Beautiful Soup is a Python library which is very handy for projects like screen-scraping. Here’s a brief tutorial on how to scrape a list of the top 250 movies from IMBD.com and write them to a local text file:
1) Download Beautiful Soup
Downloading Beautiful Soup is very easy. I’m currently using version 3 and so I simply downloaded the tarball and copied BeautifulSoup.py to my Python project folder.
2) Copy IMBD Top 250 Movies Web Page Locally
Since my Python application is not sending an HTTP user-agent, any requests that my application sends to IMDB.com are rejected. I’ll probably fix this at some point, but for now the easiest solution was to save a copy of the Top 250 Movies web page to my local hard drive e.g. imdb250.htm.
3) Write Python Code Using Beautiful Soup (imdb.py)
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
with open('imdb.txt') as f: pass
print "File (imdb.txt) already exists."
except IOError as e:
print "Generating new file (imdb.txt)."
text = urlopen('imdb250.htm').read()
soup = BeautifulSoup(text)
f = open("imdb.txt", "w")
table = soup.find('table')
links = table.findAll('a')
for item in links:
f.write(item.string + '\n')
print "Target file (imdb250.htm) could not be found."
4) Run Program
$ python imdb.py
First, I check to see if “imdb.txt” i.e. my output file exists. If the file already exists, then I don’t need to do anything. If the file doesn’t exist, then I open the local version of the IMDB web page i.e. “imdb250.htm” in read mode.
Next I instantiate the BeautifulSoup class with the HTML from that web page. Next I use BeautifulSoup to find any instances of HTML tables in the page and then any <a> tags (which I now are links to the movie pages).
Then I open my output file i.e. “imdb.txt” in write mode and I write the string value i.e. title of each movie to that text file. Then I close the file and we’re done.
Whenever, I need to re-run this I just make another local copy of the IMDB web page and then delete the “imdb.txt” file.
Here is an example of the “imdb.txt” file created by this program.