Cranes at CVR Q23
NLP Analysis on Frankenstein (Full Book Text)
Dataset:
Use the file frankenstein.txt containing the full text of Mary Shelley's novel.
frankenstein.txt
Part A – Preprocessing and Basic Statistics
- Read the full text from
frankenstein.txt.
- Remove Gutenberg headers and footers (if present).
- Convert text to lowercase and remove punctuation.
- Tokenize into words and compute:
- Total number of words
- Number of unique words
- Average word length
- Total number of sentences
- Average sentence length (in words)
Part B – Vocabulary and Frequency
- Find and display the top 20 most frequent words (after removing stopwords).
- List 15 words that appear only once (hapax legomena).
- Extract and display the 10 longest words in the text.
Part C – Named Entity and Character Analysis
- Use POS tagging or NER to extract proper nouns (names, places).
- Count how many times each major character appears (e.g., Victor, Frankenstein, monster, Elizabeth).
- Find 5 sentences where both "Victor" and "monster" appear together.
Part D – Thematic and Chapter Analysis
- Split the novel by chapter (using the keyword "Chapter").
- For each chapter:
- Count the number of words
- Find top 5 words (excluding stopwords)
- Track emotion using a lexicon (e.g., NRC or TextBlob):
- Compute emotion/sentiment scores per chapter
- Plot the sentiment progression across chapters
- Implement KWIC (Keyword-in-Context) for terms like:
- “life”, “death”, “science”, “creation”
Part E – Stylometric and Dialogue Analysis
- Compute the type-token ratio (unique words / total words).
- Compare the ratio across early and later chapters.
- Detect passive voice constructions (e.g., “was created”, “had been abandoned”).
- Extract all direct speech (lines within quotation marks).
- Compare sentiment of spoken dialogue vs. narration.
Part F – Advanced NLP (Optional)
- Apply topic modeling (LDA) to extract 5 major topics from the text.
- Use TF-IDF and cosine similarity to compare Chapter 1 with Chapter 10.
- Generate a summary of the entire book using:
- Extractive method (TextRank)
- Abstractive method (Transformer model)
- Build a binary classifier to distinguish narration vs. dialogue using linguistic features.
- Identify most common POS sequences and compare between narration and dialogue.
Output Requirements
- Print statistics and sample outputs clearly.
- Use visualizations for frequencies, emotions, and comparisons.
- Include sample quotes or snippets in contextual tasks.