Data Scraping — A fun, yet Winding Road

Mark Dowicz
5 min readMay 7, 2021

As a Data Scientist — data cleaning, EDA, and modeling are only part of the battle. The process of finding and collecting representative data in itself is crucial to the data science workflow. In some cases, that can be an excruciating process. Luckily, we have access to one of the largest databases in the world, the web.

Being able to scrape data directly from websites is an effective, powerful, and reproducible tool that every data scientist should have under their belt. A recent project that I have been working on revolves around a combination of movie data scraped from the IMDb website, as well as data gathered from the TMDb API. This post will focus on scraping data directly from the IMDb website and putting it into a clean data-frame that we can reference later (for the purpose of this blog I will only be collecting the Title and href for each movie).

The first thing we will need to do is find an IMDb url that has a list of movies that we want to collect data from. I found a solid list of 6,656 movies in descending order by IMDb rating at this link here. You’ll notice that each page only contains 50 movies, and the next list of movies (51–100) is found on the following page. You’ll also notice that the url for the first 50 movies looks different from the other pages. This is a small yet important road block. Let’s compares the two urls.

The page_url is identical to the base_url except for the end; where ‘&start=51&ref_=adv_nxt’ is seen. If you scroll down to the bottom of the page on your browser and click next that ‘51’ turns to ‘101’ and then ‘151’ and so on. This is good! We can use these increments of 51 to loop through every single page, except for the first. So, let’s scrape the movies 1–50 first. We will start by importing the libraries necessary to scrape the data and make a data-frame.

From here, we can submit a request to access the url’s data using the requests library (sorry for the cut off, but yes, this is the same base_url as above).

The “Response [200]” means our request went through and we can now access all the information from that webpage. We do this by creating a BeautifulSoup object like so.

This soup object is an iterable object that we can loop through to find the information that we need from that specific page that we requested. Simply printing out the soup object we created looks incredibly messy and intimidating. There’s too much information for us to parse out and find the info (movie titles and hrefs) we are looking for. What we can do is utilize the inspect option on our internet browser directly on the webpage itself to locate the information we want. Here is what that looks like.

The bottom image shows us exactly where the information is located amongst the muck of other html info that we don’t necessarily care about (for right now anyway). Let’s breakdown what we are looking at.

We notice all these drop-down arrows and tags wrapped in <>. These tags are used as containers for information, and each tag has a class which indicates what is contained in that specific tag. Our information (the title) is located within the <h3> tag with class = ‘lister-item-content’. This is exactly what we are looking for. Every other movie on this page follows the same format, where the movie title is located in the <h3> tag with class = ‘lister-item-content’. Let’s use this information to loop through the entire page and collect all 50 movie titles. We can do that with a simple for loop and empty lists, where we will append each movie title into.

The BeautifulSoup library allows us to specify the tags that we are looking for, along with the specific key value pair (class = ‘lister-item-header’) we are interested in. This is a very simple, yet powerful way to loop through html documents.

One thing I have yet to mention is the .find(‘a’) in our loop. The ‘a’ element represents an anchor element. If you notice on the actual webpage, the titles are all hyper-linked. This ‘a’ element contains two pieces of information, text and its ‘href’ element. You’ll notice the ‘a’ doesn’t have a ‘class’ object, but rather an ‘href’ object. This is a unique object that is specific to the movie it is referencing. We want that information too. So let’s grab it.

Cool, now we have all 50 movies and their corresponding hrefs. But 50 movies isn’t very many, let’s get all 6,656.

To do that we need a list of numbers ranging from 51 to 6652, skipping every 50 (51, 101, 151 etc.). From there we can put in that number into our page_url to get a list of urls that we need using an f-string, and then loop through every web-page and collect the movie title and corresponding href. Let’s take a look.

Awesome! Both of our lists now contain 6,656 movies and corresponding hrefs. From here we can create and pandas data-frame.

--

--