Cranes at CVR Q23

NLP Analysis on Frankenstein (Full Book Text)

Dataset:

Use the file frankenstein.txt containing the full text of Mary Shelley's novel. frankenstein.txt


Part A – Preprocessing and Basic Statistics
  1. Read the full text from frankenstein.txt.
  2. Remove Gutenberg headers and footers (if present).
  3. Convert text to lowercase and remove punctuation.
  4. Tokenize into words and compute:
    • Total number of words
    • Number of unique words
    • Average word length
    • Total number of sentences
    • Average sentence length (in words)

Part B – Vocabulary and Frequency
  1. Find and display the top 20 most frequent words (after removing stopwords).
  2. List 15 words that appear only once (hapax legomena).
  3. Extract and display the 10 longest words in the text.

Part C – Named Entity and Character Analysis
  1. Use POS tagging or NER to extract proper nouns (names, places).
  2. Count how many times each major character appears (e.g., Victor, Frankenstein, monster, Elizabeth).
  3. Find 5 sentences where both "Victor" and "monster" appear together.

Part D – Thematic and Chapter Analysis
  1. Split the novel by chapter (using the keyword "Chapter").
  2. For each chapter:
    • Count the number of words
    • Find top 5 words (excluding stopwords)
  3. Track emotion using a lexicon (e.g., NRC or TextBlob):
    • Compute emotion/sentiment scores per chapter
    • Plot the sentiment progression across chapters
  4. Implement KWIC (Keyword-in-Context) for terms like:
    • “life”, “death”, “science”, “creation”

Part E – Stylometric and Dialogue Analysis
  1. Compute the type-token ratio (unique words / total words).
  2. Compare the ratio across early and later chapters.
  3. Detect passive voice constructions (e.g., “was created”, “had been abandoned”).
  4. Extract all direct speech (lines within quotation marks).
  5. Compare sentiment of spoken dialogue vs. narration.

Part F – Advanced NLP (Optional)
  1. Apply topic modeling (LDA) to extract 5 major topics from the text.
  2. Use TF-IDF and cosine similarity to compare Chapter 1 with Chapter 10.
  3. Generate a summary of the entire book using:
    • Extractive method (TextRank)
    • Abstractive method (Transformer model)
  4. Build a binary classifier to distinguish narration vs. dialogue using linguistic features.
  5. Identify most common POS sequences and compare between narration and dialogue.

Output Requirements