In this article, you'll learn how to do web scraping on imdb.com for fetching information about movies with different genres using python with Beautiful Soup and requests.
For those who are new to web scraping, let's talk something about that, in order to get a large amount of data for different purposes like data science, data analysis, machine learning, and so on, one of the best ways is to use the web as a source for collecting data, but it is not practical to go to different websites and finding relevant information that can be stored in a specific format manually. In this case, there is a technique known as web scraping with automates the process of collecting data from webpages and storing those in suitable formats. Web scraping helps to extract a large amount of information from different websites within a short time.
Why scraping from IMDB?
Suppose you want to build a movie recommendation engine that recommends movies according to your taste, for this purpose you'll require data sets of different movies of different genres including their name, rating, released year, Metascore, genre, simple description, movie certificate, votes, etc. There is one place where you will get all these details which is the IMDB(Internet Movie Database) website owned by Amazon, one of the best platforms for finding information about films, television shows, web series, etc.
Steps involved:
- Install and import required modules.
- Getting URLs of different genres.
- Parse the page with the URL containing movies of different genres using Beautiful Soup and requests.
- Extract information about the movie title, genre, year of release, rating, certificate, Metascore, votes, etc
- Convert all information into a pandas data frame and save it as a CSV file.
Installing and Importing modules
Install the modules using pip
$ pip install requests lxml bs4 pandas
- requests: This module in python helps to send HTTP requests to a particular URL.
- lxml: Helps to convert the content of the webpage into corresponding XML or HTML format.
- bs4: Beautiful Soup is a tool for parsing information from web pages.
- pandas: A python library that helps to frame the data as rows and columns.
import requestsfrom bs4 import BeautifulSoupimport pandas as pdHEADERS ={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
Along with imported modules we also created a User-Agent Header file for servers to identify the OS, version, Application, etc.
Getting URLs of different pages
The first thing we can do is to get URLs of different movie genres, for example, the genres include Animation, Adventure, Drama, Comedy, Horror, etc.
genres = ["Adventure","Animation","Biography","Comedy","Crime","Drama","Family","Fantasy","Film-Noir","History","Horror","Music","Musical","Mystery","Romance","Sci-Fi","Sport","Thriller","War","Western"]url_dict = {}for genre in genres:url = "https://www.imdb.com/search/title/?genres={}&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=N97GEQS6R7J9EV7V770D&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_16"formated_url = url.format(genre)url_dict[genre] = formated_urlprint(url_dict)
The code iteratively changes the URL with different genres stored in the genres list so that we'll get URLs of different movie genres. The genre and the corresponding URLs are then stored inside a dictionary.
Parsing movie information
Now let's parse the movie information from IMDB
url = "https://www.imdb.com/search/title/?genres=Adventure&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=N97GEQS6R7J9EV7V770D&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_16"# Sending a request to the speciifed URLresp = requests.get(url, headers=HEADERS)# Converting the response to Beautiful Soup Objectcontent = BeautifulSoup(resp.content, 'lxml')# Iterating throught the list of moviesfor movie in content.select('.lister-item-content'):try:# Creating a python dictonarydata = {"title":movie.select('.lister-item-header')[0].get_text().strip(),"year":movie.select('.lister-item-year')[0].get_text().strip(),"certificate":movie.select('.certificate')[0].get_text().strip(),"time":movie.select('.runtime')[0].get_text().strip(),"genre":movie.select('.genre')[0].get_text().strip(),"rating":movie.select('.ratings-imdb-rating')[0].get_text().strip(),"metascore":movie.select('.ratings-metascore')[0].get_text().strip(),"simple_desc":movie.select('.text-muted')[2].get_text().strip(),"votes":movie.select('.sort-num_votes-visible')[0].get_text().strip()}except IndexError:continueprint(data)
The above code sends a request to the specified URL and returns a response. This response is then converted to an HTML form using Beautiful Soup and lxml.
Then the CSS selectors is being copied from the page that contains the information we need, for example, the movie titles are included in the ".lister-item-header" class. You can inspect the elements in the web page in order to find the CSS selectors but the selectors in the code are probably the same.
The output will look like this:
{'title': '1.\nThe Lord of the Rings: The Return of the King\n(2003)', 'year': '(2003)', 'certificate': 'U', 'time': '201 min', 'genre': 'Action, Adventure, Drama', 'rating': '8.9', 'metascore': '94 \n Metascore', 'simple_desc': "Gandalf and Aragorn lead the World of Men against Sauron's army to draw his gaze from Frodo and Sam as they approach Mount Doom with the One Ring.", 'votes': 'Votes:\n1,751,318\n| Gross:\n$377.85M'}{'title': '2.\nInception\n(2010)', 'year': '(2010)', 'certificate': 'UA', 'time': '148 min', 'genre': 'Action, Adventure, Sci-Fi', 'rating': '8.8', 'metascore': '74 \n Metascore', 'simple_desc': 'A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster.', 'votes': 'Votes:\n2,230,857\n| Gross:\n$292.58M'}{'title': '3.\nThe Lord of the Rings: The Fellowship of the Ring\n(2001)', 'year': '(2001)', 'certificate': 'U', 'time': '178 min', 'genre': 'Action, Adventure, Drama', 'rating': '8.8', 'metascore': '92 \n Metascore', 'simple_desc': 'A meek Hobbit from the Shire and eight companions set out on a journey to destroy the powerful One Ring and save Middle-earth from the Dark Lord Sauron.', 'votes': 'Votes:\n1,772,911\n| Gross:\n$315.54M'}{'title': '4.\nIl buono, il brutto, il cattivo\n(1966)', 'year': '(1966)', 'certificate': 'A', 'time': '161 min', 'genre': 'Adventure, Western', 'rating': '8.8', 'metascore': '90 \n Metascore', 'simple_desc': 'A bounty hunting scam joins two men in an uneasy alliance against a third in a race to find a fortune in gold buried in a remote cemetery.', 'votes': 'Votes:\n733,132\n| Gross:\n$6.10M'}............
Creating a scraping function
Now let's create a function that does the same as above but it can be reused several times for different URLs.
import timedef get_movies(url, interval, file_name):resp = requests.get(url, headers=HEADERS)content = BeautifulSoup(resp.content, 'lxml')movie_list = []for movie in content.select('.lister-item-content'):time.sleep(interval)try:data = {"title":movie.select('.lister-item-header')[0].get_text().strip(),"year":movie.select('.lister-item-year')[0].get_text().strip(),"certificate":movie.select('.certificate')[0].get_text().strip(),"time":movie.select('.runtime')[0].get_text().strip(),"genre":movie.select('.genre')[0].get_text().strip(),"rating":movie.select('.ratings-imdb-rating')[0].get_text().strip(),"metascore":movie.select('.ratings-metascore')[0].get_text().strip(),"simple_desc":movie.select('.text-muted')[2].get_text().strip(),"votes":movie.select('.sort-num_votes-visible')[0].get_text().strip()}except IndexError:continuemovie_list.append(data)dataframe = pd.DataFrame(movie_list)dataframe.to_csv(file_name)url = "https://www.imdb.com/search/title/?genres=Adventure&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=N97GEQS6R7J9EV7V770D&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_16"# Calling the functionget_movies(url, 0, 'Adventure_movies.csv')
This function creates a python dictionary that contains all the information we parsed from the web page and then it creates a pandas data frame that is being saved as a CSV file. There are three parameters for this function, the URL of the web page, the interval at which the process wants to be done, and the file name you can save as CSV.
In the above case, a CSV file is saved in the working directory as 'Adventure_movies.csv'.
Scraping movies of different genres
The get_movies() function we write above can parse details from the IMDB web page of different genre URLs and can save them as a CSV file. So by using this function it is possible to scrape all genres that can be saved as separate CSV files. So let's see how this can be done.
for genre, url in url_dict.items():get_movies(url, 1, genre+'.csv')print("Saved:", genre+'.csv')
You can see the different genre movies are being saved in your working directory one by one.