Screen Scraping with Beautiful Soup

Beautiful Soup

Beautiful Soup is a Python library which is very handy for projects like screen-scraping.  Here’s a brief tutorial on how to scrape a list of the top 250 movies from and write them to a local text file:

1) Download Beautiful Soup

Downloading Beautiful Soup is very easy. I’m currently using version 3 and so I simply downloaded the tarball and copied to my Python project folder.

2) Copy IMBD Top 250 Movies Web Page Locally

Since my Python application is not sending an HTTP user-agent, any requests that my application sends to are rejected. I’ll probably fix this at some point, but for now the easiest solution was to save a copy of the Top 250 Movies web page to my local hard drive e.g. imdb250.htm.

3) Write Python Code Using Beautiful Soup (

import sys
import string
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

    with open('imdb.txt') as f: pass
    print "File (imdb.txt) already exists."
except IOError as e:
    print "Generating new file (imdb.txt)."
        text = urlopen('imdb250.htm').read()

        soup = BeautifulSoup(text)

        f = open("imdb.txt", "w")

        table = soup.find('table')

        links = table.findAll('a')
        for item in links:
            f.write(item.string + '\n')

        print "Target file (imdb250.htm) could not be found."

4) Run Program

$ python

First, I check to see if “imdb.txt” i.e. my output file exists.  If the file already exists, then I don’t need to do anything. If the file doesn’t exist, then I open the local version of the IMDB web page i.e. “imdb250.htm” in read mode.

Next I instantiate the BeautifulSoup class with the HTML from that web page. Next I use BeautifulSoup to find any instances of HTML tables in the page and then any <a> tags (which I now are links to the movie pages).

Then I open my output file i.e. “imdb.txt” in write mode and I write the string value i.e. title of each movie to that text file. Then I close the file and we’re done.

Whenever, I need to re-run this I just make another local copy of the IMDB web page and then delete the “imdb.txt” file.

Here is an example of the “imdb.txt” file created by this program.

