What You Need to Know About Web Scraping Rotten Tomatoes

Rotten Tomatoes is the most popular movie review site on earth, but it's not easy to scrape. The site has a RESTful interface with some authentication tokens required, and there are also proxies. That's where we come in - follow our how-to guide, and you'll be on your way to scraping!

Whether you want to use the data for your portfolio projects or just for curiosity's sake, it can be helpful. In this tutorial, you'll learn how to scrape Rotten Tomatoes by starting with the basic information page of the review section and then working your way down to the individual movie pages.

What is web scraping?

Web scraping or web harvesting is a technique of extracting information from the web by programmatically accessing it using a computer. Various implementations are possible in multiple languages, but we'll use Python here.

How can you use web scraping for businesses?

Consider gathering data on the competition or your products if you run a small business or blog. Web scraping can be done manually, but it's much more efficient if done automatically. For example, if you want to find out what a competitor's page looks like and how many products they offer - scrape their page!

First, let's install the necessary Python libraries. Make sure you have pip installed, and then use the following command to fetch all the required libraries.

Note: You should update the packages in the following code snippet if your pip installation is outdated.

Let's Start Scraping Rotten Tomatoes

To begin, download and install wget as well as Python, M2Crypto and BeautifulSoup4. We'll be using BeautifulSoup4 for parsing Rotten Tomatoes HTML so you can understand it. Once you have them installed, follow the steps below:

Step 1: Get a Ticket Number

Get a ticket number by visiting http://www.rottentomatoes.com/contact/form and filling out the entire form. We'll use this ticket number as our session token.

Step 2: Log in for Rotten Tomatoes Accounts

Create an account here if you've never logged in to Rotten Tomatoes. It will give you a new account number we need to use to access the site's RESTful interface.

Here are the descriptions of the different method calls:

POST: log in with the authentication token

GET: get content as JSON

POST: query data by parameters, e.g., movies

POST: query data by ID

GET: get data from a resource dereference in an id parameter

GET: get data from a resource dereference using a hash parameter, e.g. a movie

GET: get data from a resource and if it's a video, also get metadata

Step 3: Log in to Rotten Tomatoes

Now you're ready to log in with your new account number. Go to https://www.rottentomatoes.com/login/?url=/reader/login and paste the ticket number for your session token in the login_url parameter. Then click log in, and you'll be logged in to Rotten Tomatoes! You can browse around the Rotten Tomatoes site, but any attempts to manipulate data could trigger an error from RT's security measures (which is what we want!).

Step 4: Scrape Data of your favorite movies using libraries:

In this step, we will scrape a list of films and their reviews from the Rotten Tomatoes database.

For this tutorial, we'll be using BeautifulSoup4 to parse the HTML string of a movie's information page. We'll need three things:

The movie title is in the following format: http://www.rottentomatoes.
The movie release date (in YYYY-MM-DD format).
The Rotten Tomatoes score.

First, let's scrape all the movie titles. For this purpose, we can use the get() method of urllib and pass the URL to be scraped as a parameter.

We'll be using four parameters for this snippet, but there are only three unique parameters needed:

RottenTomatoesTicketNumber: You obtained a ticket number from the ticket request form in Step 1.

Step 5 : The Rotten Tomatoes API version

Change it appropriately if you want to experiment with newer APIs or other languages!

We start by importing urllib and BeautifulSoup4 libraries. Then we construct the URL of the API call to be made and store it in rtURL . We also build a dictionary which contains the arguments that will pass along with the method call.

Next, we make a POST request using urllib.

After reading the HTML, we parse out only the content block using BeautifulSoup4's parse() method and get the title by finding all

tags with an id attribute value of 'title'.

We then display the HTML using print " <pre>%s</pre>" % HTML so that you can verify that we're getting what is expected.

Next, we loop through the dictionary parameters and create a query with each of them. Then we extract their values into a variable called movie_info. Once we have the movie information, we can get the review data.

Here is an example of the output:

<table class="table table-bordered"> <tbody> <tr> <td class="col-md-12">Title</td> <td class="col-md-12">Release Year</td> <td class="col-md-12"><a href="/movies/4915/title.html" target="_blank">The Room</a></td> </tr> ...

First, we check the title has a href value. If it does, we grab it with getattr , and then use BeautifulSoup4's find_all() method to retrieve the direct links for the title, release year and Rotten Tomatoes rating. If all parameters are found in the HTML, then we can get the information from BeautifulSoup4, which parses the field and assigns its value to a variable called first_mv.

Step 6: Scrape Rotten Tomatoes Access Token:

Now that we have all the movies collected in a variable let's access more interesting data about them.

To do this, we'll need to use BeautifulSoup4's get_model() method to access a resource with a specific ID on Rotten Tomatoes. This method requires four parameters (the first two being mandatory):

The resource being accessed: http://www.rottentomatoes.com/m/title/tt?t=

The ID of the resource being accessed:

The Rotten Tomatoes API version: 2

The session token that login() returned.

An important thing to note is this will only work if the resource exists in the RT database, but even so, it's good to have around in case you want to make some changes on your own later.

Store the response of getting_model() into the variable response. For the movie resource, the display_name value will be 'title', the release_date value will be '2014-12-15', and the review_count will be 50.

Finally, we'll display some of this information on our terminal.

Step 7: Get a list of 10 movie plot summaries:

Now let's extract just a few movie plot summaries with another method called get_plots(). For this, we'll use BeautifulSoup4's get_hits() method which accepts four parameters:

The resource: http://www.rottentomatoes.

To get the plot summaries, we'll loop through the movie_info dictionary to determine the IDs of each movie and extract just a few from it using BeautifulSoup4's get_hits() method, which accepts four parameters:

The resource being accessed: is http://www.rottentomatoes.

Once again, let's display some of this information on our terminal.

Step 8: Store the scraped data

The parsed information can now be stored in an appropriate data format. In this tutorial case, we are keeping them in a CSV file.

Here is the content of the file:

Title, Release Date, URL

Guardians of the Galaxy (2014),2014-08-01,

"http://www.rottentomatoes.com/m/guardians_of_the_galaxy_2014/". "Captain America: The Winter Soldier (2014),2014-04-03, "http://www.rottentomatoes.com/m/captain-america_the_winter_soldier_2014/.

Step 9: Download the Rotten Tomatoes CSV file.

Parsing the file into a dictionary should be straightforward, but let's note it down now so we remember it later.

The parsing could be better and could be improved by using regexes to build this string (because everything in Python is an expression anyway) or by making a few constants dedicated to strings and reducing all instances to just those we need.

Here are the things I could have changed:

Step 10: Clean up after ourselves:

The process is almost complete, but a few steps remain before we send the modified data for analysis.

The last thing we need to do is to close the connection with RT.

We can call other library methods, such as a file.close() or socket.close() depending on the connection type and the library used. The server should be closed by all the user's requests and pending job completion. However, we only want to keep RT open quickly to avoid putting undue load on its servers or compromising their security.

With these, you should have compiled a list of movies you are interested in or are at least curious about the plot summaries extracted from RottenTomatoes.

Conclusion

Now that the script is complete let's look at what we learned.

A movie title can be found from the first two words of an ID, and a movie release date can be seen from the year-month-date format and stored in YYYY-MM-DD format. Finally, The Rotten Tomatoes API offers a detailed description of every review using multiple tags (example: 'genre', 'comedy', etc.)

You have now learned how to scrape data from various websites in Python without writing any lines of code with BeautifulSoup4 or using the Requests library.

The script used the urllib library to grab information from the internet. Once this data was obtained, it was parsed using BeautifulSoup4's parse() method and stored in a dictionary for later use or analysis.

Finally, we used the get_model() method to access Rotten Tomatoes data with a specific ID and displayed some of this information on our terminal.

We hope you will enjoy making this tutorial and learn something valuable from it.