NLP Analysis of Frankenstein

Cranes at CVR Q23

NLP Analysis on Frankenstein (Full Book Text)

Dataset:

Use the file frankenstein.txt containing the full text of Mary Shelley's novel. frankenstein.txt

Part A – Preprocessing and Basic Statistics

Read the full text from frankenstein.txt.
Remove Gutenberg headers and footers (if present).
Convert text to lowercase and remove punctuation.
Tokenize into words and compute:
- Total number of words
- Number of unique words
- Average word length
- Total number of sentences
- Average sentence length (in words)

Part B – Vocabulary and Frequency

Find and display the top 20 most frequent words (after removing stopwords).
List 15 words that appear only once (hapax legomena).
Extract and display the 10 longest words in the text.

Part C – Named Entity and Character Analysis

Use POS tagging or NER to extract proper nouns (names, places).
Count how many times each major character appears (e.g., Victor, Frankenstein, monster, Elizabeth).
Find 5 sentences where both "Victor" and "monster" appear together.

Part D – Thematic and Chapter Analysis

Split the novel by chapter (using the keyword "Chapter").
For each chapter:
- Count the number of words
- Find top 5 words (excluding stopwords)
Track emotion using a lexicon (e.g., NRC or TextBlob):
- Compute emotion/sentiment scores per chapter
- Plot the sentiment progression across chapters
Implement KWIC (Keyword-in-Context) for terms like:
- “life”, “death”, “science”, “creation”

Part E – Stylometric and Dialogue Analysis

Compute the type-token ratio (unique words / total words).
Compare the ratio across early and later chapters.
Detect passive voice constructions (e.g., “was created”, “had been abandoned”).
Extract all direct speech (lines within quotation marks).
Compare sentiment of spoken dialogue vs. narration.

Part F – Advanced NLP (Optional)

Apply topic modeling (LDA) to extract 5 major topics from the text.
Use TF-IDF and cosine similarity to compare Chapter 1 with Chapter 10.
Generate a summary of the entire book using:
- Extractive method (TextRank)
- Abstractive method (Transformer model)
Build a binary classifier to distinguish narration vs. dialogue using linguistic features.
Identify most common POS sequences and compare between narration and dialogue.

Output Requirements

Print statistics and sample outputs clearly.
Use visualizations for frequencies, emotions, and comparisons.
Include sample quotes or snippets in contextual tasks.