Cranes at CVR Q22

NLP Analysis on IMDb API Movie Metadata

Problem Statement:

You are given an IMDb ID as input. Use the IMDb236 API to retrieve metadata and apply Natural Language Processing (NLP) techniques to understand the textual information provided in the movie data.


Part A – API Setup

Part B – NLP Tasks on Various Text Fields
  1. description:
    • Clean the description using regex (retain only alphabets).
    • Convert to lowercase, tokenize, remove stopwords.
    • Lemmatize the words.
    • Find and display the top 5 most frequent words.
  2. interests (list of keywords):
    • Join into a single string and treat it as a document.
    • Use TF (term frequency) to find dominant thematic words.
    • Compare overlap between description and interests (intersection).
  3. genres (e.g., Action, Crime, Drama):
    • Convert to lowercase and check if genre terms appear in description.
    • Count matches and list missing genre mentions (e.g., a movie listed as “Drama” but “drama” doesn’t occur in the description).
  4. cast fullName and characters:
    • Extract all character names and actor names.
    • Use named entity pattern matching (NER-style) to classify real person vs. fictional name (if feasible).
    • Check for how many character names are mentioned in the description.
  5. writers and directors:
    • Use lemmatization to group similar writer/director roles (e.g., write, writer, written).
    • Generate a word cloud or frequency bar plot of creator names (first names, surnames).