What’s your favorite movie or TV show? Wouldn’t it be nice to find more shows that you might like to watch, based on ones you know you like? Tools that address questions like this are often called “recommender systems.” Powerful, scalable recommender systems are behind many modern entertainment and streaming services, such as Netflix and Spotify. While most recommender systems these days involve machine learning, there are also ways to make recommendations that don’t require such complex tools.

In this Blog Post, you’ll use webscraping to answer the following question:

What movie or TV shows share actors with your favorite movie or show?

The idea of this question is that, if TV show Y has many of the same actors as TV show X, and you like X, you might also enjoy Y.

This post has two parts. In the first, larger part, you’ll write a webscraper for finding shared actors on IMDB. In the second, smaller part, you’ll use the results from your scraper to make recommendations.

Don’t forget to check the Specifications for a complete list of what you need to do to obtain full credit. As usual, this Blog Post should be printed as PDF from your PIC16B Blog.

§1. Setup

§1.1. Locate the Starting IMDB Page

Pick your favorite movie or TV show, and locate its IMDB page. For example, my favorite TV show is Star Trek: Deep Space Nine. Its IMDB page is at:

https://www.imdb.com/title/tt0106145/

Save this URL for a moment.

§1.2. Dry-Run Navigation

Now, we’re just going to practice clicking through the navigation steps that our scraper will take.

First, click on the Cast & Crew link. This will take you to a page with URL of the form

<original_url>fullcredits/

Next, scroll until you see the Series Cast section. Click on the portrait of one of the actors. This will take you to a page with a different-looking URL. For example, the URL for Avery Brooks, star of DS9, is

https://www.imdb.com/name/nm0000984/

Finally, scroll down until you see the actor’s Filmography section. Note the titles of a few movies and TV shows in this section.

Our scraper is going to replicate this process. Starting with your favorite movie or TV show, it’s going to look at all the actors in that movie or TV show, and then log all the other movies or TV shows that they worked on.

§1.3 Initialize Your Project

  1. Create a new GitHub repository, and sync it with GitHub Desktop. This repository will house your scraper. You should commit and push each time you make significant changes to your code.
  2. Open a terminal in the location of your repository on your laptop, and type:
    conda activate PIC16B
    scrapy startproject IMDB_scraper
    cd IMDB_scraper
    

    This will create quite a lot of files, but you don’t really need to touch most of them.

§1.4 Tweak Settings

For now, add the following line to the file settings.py:

CLOSESPIDER_PAGECOUNT = 20

This line just prevents your scraper from downloading too much data while you’re still testing things out. You’ll remove this line later.

§2. Write Your Scraper

Create a file inside the spiders directory called imdb_spider.py. Add the following lines to the file:

# to run 
# scrapy crawl imdb_spider -o movies.csv

import scrapy

class ImdbSpider(scrapy.Spider):
    name = 'imdb_spider'
    
    start_urls = ['https://www.imdb.com/title/tt0106145/']

Replace the entry of start_urls with the URL corresponding to your favorite movie or TV show.

Now, implement three parsing methods for the ImdbSpider class.

  • parse(self, response) should assume that you start on a movie page, and then navigate to the Cast & Crew page. Remember that this page has url <movie_url>fullcredits. Once there, the parse_full_credits(self,response) should be called, by specifying this method in the callback argument to a yielded scrapy.Request. The parse() method does not return any data. This method should be no more than 5 lines of code, excluding comments and docstrings.
  • parse_full_credits(self, response) should assume that you start on the Cast & Crew page. Its purpose is to yield a scrapy.Request for the page of each actor listed on the page. Crew members are not included. The yielded request should specify the method parse_actor_page(self, response) should be called when the actor’s page is reached. The parse_full_credits() method does not return any data. This method should be no more than 5 lines of code, excluding comments and docstrings.
  • parse_actor_page(self, response) should assume that you start on the page of an actor. It should yield a dictionary with two key-value pairs, of the form {"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name}. The method should yield one such dictionary for each of the movies or TV shows on which that actor has worked. Note that you will need to determine both the name of the actor and the name of each movie or TV show. This method should be no more than 15 lines of code, excluding comments and docstrings.

Provided that these methods are correctly implemented, you can run the command

scrapy crawl imdb_spider -o results.csv

to create a .csv file with a column for actors and a column for movies or TV shows.

§2.1. Hints

Here are a few hints for the two trickier parsing methods.

parse_full_credits()

On the Cast & Crew page, the list comprehension

[a.attrib["href"] for a in response.css("td.primary_photo a")]

will create a list of relative paths, one for each actor. This command mimics the process of clicking on the headshots on this page.

parse_actor_page()

This is likely to be the most challenging method to implement. I’ll leave the challenge of figuring out the actor name to you. As far as the titles of movies and TV shows, here are CSS selectors that I found to be very helpful. You might be able to use them in your own solution.

element.css("::attr(id)")
element.css("div.filmo-row")
element.css("a::text")

Note that these are not in the order in which you should call them. Experimentation in the scrapy shell is strongly recommended. Don’t forget that you can also use the Developer Tools on your browser to inspect individual HTML elements.

§3. Make Your Recommendations

Once your spider is fully written, comment out the line

CLOSESPIDER_PAGECOUNT = 20

in the settings.py file. Then, the command

scrapy crawl imdb_spider -o results.csv

will run your spider and save a CSV file called results.csv, with columns for actor names and the movies and TV shows on which they worked.

Once you’re happy with the operation of your spider, compute a sorted list with the top movies and TV shows that share actors with your favorite movie or TV show. For example, here’s the list I obtained for Star Trek: Deep Space Nine. Of course, most shows will “share” the most actors with themselves.

movie number of shared actors
0 Star Trek: Deep Space Nine 538
1 ER 116
2 Star Trek: Voyager 111
3 NYPD Blue 111
4 Star Trek: The Next Generation 108
5 L.A. Law 104
6 JAG 99
7 Murder, She Wrote 85
8 Hunter 82
9 The Practice 78

Feel free to be creative. You can show a pandas data frame, a chart using matplotlib or plotly, or any other sensible display of the results.

§5. Blog Post

In your blog post, you should describe how your scraper works, as well as the results of your analysis. When describing your scraper, I recommend dividing it up into the three distinct parsing methods, and discussing them one-by-one. For example:

In this blog post, I’m going to make a super cool web scraper… Here’s a link to my project repository… Here’s how we set up the project…

<implementation of parse()>

This method works by…

<implementation of parse_full_credits()>

To write this method, I…

In addition to describing your scraper, your Blog Post should include a table or visualization of numbers of shared actors like the one shown above, as well as a link to your GitHub repository.

Remember that this post is still a tutorial, in which you guide your reader through the process of setting up and running the scraper. Don’t forget to tell them how to create the project and run the scraper!

§4. Specifications

Coding Problem

  1. Each of the three parsing methods appear logically and correctly implemented.
  2. parse() is implemented in no more than 5 lines.
  3. parse_full_credits() is implemented in no more than 5 lines.
  4. parse_actor_page() is implemented in no more than 15 lines.
  5. A table or list of results is shown.
  6. The code is housed in a GitHub repository.

Style and Documentation

  1. Each of the three parse methods has a short docstring describing its assumptions (e.g. what kind of page it is meant to parse) and its effect, including navigation and data outputs.
  2. Each of the three parse methods has helpful comments for understanding how each chunk of code operates.

Writing

  1. The blog post is written in tutorial format, in engaging and clear English. Grammar and spelling errors are acceptable within reason.
  2. The blog post explains clearly how to set up the project, run the scraper, and access the results.
  3. The blog post explains how each of the three parse methods works.
  4. The blog post includes a link to the GitHub repository for the scraper project.