Using Solvent to extract data from structured pages

solvent_logo.png

There is a lot of structured data in web pages. While this data is usually backed by structured storage of some sort, a lot of the semantics of the data are lost by the time the page is rendered in the web browser. Simile Solvent allows you to capture data from web pages as RDF, a common data representation, allowing data consumers to explore the data on their own terms. But even if you have no interest in RDF (or the Semantic Web) you can still use these tools to generate something you’re more familiar with (like a spreadsheet). Of course, if you are interested in RDF (or learning more about RDF) this can be a great way to get yourself some data to play with. More on that in a later post.

Solvent is not for every task. It is best for smaller datasets and reasonably straightforward structures. For anyone with serious data-scraping needs, it’s probably best to write your own non-browser-based solution, especially if you’d like to automate the data-scraping, handle error conditions, and cache data locally.

A few other notes: Most of the experience I have with Solvent is with extracting data from lists of items that have properties. I would like to experiment more with items that are linked to one another and items from different data sources. Also, if you are generating RDF for consumption, spend some extra time thinking about how you want to structure the data and picking good URIs (possibly ones that already exist) for your data items and properties. I’ve kept mine very simple (http://someproperty as opposed to something more appropriate like http://wingerz.com/baseball#someproperty), But for now, this should help you to get started.

0. Get the plugin and install it.

1. Find some data. The data needs to be in a reasonably structured format, preferably generated by a machine or an anal human. Table structures with one row per element are especially good, but anything with a consistent structure will work fine. Bring up the Solvent interface by clicking the icon in the status bar.

icon.png
Solvent icon.

Inspired by Lee, we’ll be looking at some baseball team rosters. In particular, let’s start with the Red Sox.

data.png
Innocent data, waiting to be freed.

2. Capture a unit of content. Click the ‘Capture’ button and move the mouse around on the page. When you click on an element, Solvent places the XPath expression into the text input box and populates the list below with items that satisfy the path. The end goal is to have each item in the list enclose a unit of content. If you expand an item you will see some of the text that has been parsed out of the element captured by the path. Each of these has its own path as well, though these are not displayed.

Note that this section will sometimes require a bit of experimentation. You may have to click on a TD element to get to an enclosing TR. In order to drop the last part of the path, click the blue arrow or change the path yourself by editing the text. Be sure to inspect the XPath expression, it errs on the side of being as specific as possible. For example, some tables color even rows one color and odd rows another color by specifying different classes for each set of rows. If you happen to click on one of the even rows when capturing, you will find that the XPath expression includes [@class=”evenrow”] or something to that effect.

On my first attempt with the baseball data, I clicked ‘Capture’ and then picked a cell with a player’s birthday. This resulted in all of the TD‘s being selected. What I really want is all of the enclosing TRs, so I edit the expression accordingly.

capture.png
Experiment until you get what you want. This isn’t right yet.

Sometimes you may not be able to capture your Items exactly the way you want to. For example, a table may take two rows to display a unit of content. This case will require more work on your part.

3. Once each Item in the Item list encapsulates what you want, assign some properties and variables. This is how you tell Solvent how to generate RDF from your data. Pick one of the items and expand it. You will see some textual properties. If you select one of them and click the Name drop-down, you will have several options.
Item’s URI: If there is an identifying URI associated with the item, assign it to this option. This URI represents the item and in the generated RDF it will be the subject of statements about this particular item.
Item’s title: Will be assigned as the item’s dc:title.
Item’s description: Will be assigned as the item’s dc:description.
Item’s address: Will be assigned as the item’s address. Also triggers a coordinate lookup, which is also dropped into the RDF as the coordinates.
Custom property: Enter the URI of the predicate that you would like to use to link the item to this text.
Custom variable: Enter a variable name and the value will be assigned to a JavaScript var of this name.
Process text further: Do any sort of text massaging that is necessary. This comes in handy for removing $’s and ,’s from money values and doing regular expression matching.

The above can be combined with one another. For example, after processing text further you probably want to assign it to a property or a variable. In the case where you would like to generate multiple values from processed text, generate one of the values, follow the next step, and rewrite/add additional code as necessary.

There is no way to undo something here once you’ve assigned it. If you make a mistake, just follow the next step and examine the generated code. Changing property/variable names in the generated JavaScript code is quite simple.

In the baseball example, the first thing that shows up under an item is a link to a page for that player, so I’ve decided to use it as the player’s URI. The URI should be unique for every one of your data items. If your data doesn’t supply you with a URI, generate one yourself either by concatenating some other properties or just keep counter for URI generation (like http://item0,…). The player’s name serves as a good title. I’ve assigned the next thing as the http://position property. The height comes in a form that will be hard to use unless I convert it to inches first, so I used ‘Process text further’ to split the text by ' so that the output will be an array containing the numbers I want. Since I want to be able to process this further in the JavaScript code, I assign it to a variable as well. Instead of doing this I also could have had my ‘Process text further’ code do the conversion to inches and assigned it to a property.

assign.png
Assigning some stuff.

4. Generate the JavaScript code by clicking the ‘Generate’ button. This should drop a bunch of code into the code editing window on the left. If you assigned properties, you might be able to click Run (make sure you do not navigate away from the source web page) to generate some RDF (which is placed into the Results tab). Most of the time the code will require some editing to avoid processing header rows.

The basic structure of the JavaScript code is as follows: Given the XPath expression from the text input, the document is parsed to extract every element that matches the XPath expression. Then it iterates through each element, grabbing the appropriate values (also using XPath) and generating the appropriate RDF statements and JavaScript vars. Note that if you have assigned something to a var but you don’t do anything with it (like add an RDF statement containing it), it will not show up in the Results.

RDF statements are added via calls to data.addStatement. This takes four arguments: the subject, predicate, and object of the generated statement and a boolean that determines whether the object is another resource (referred to by URI) (last argument should be false) or a literal (last argument should be true). Things like a player’s team should be added as resource objects because they are entities that can have their own properties.

The log function can be used to send messages to the Console tab. This is helpful for debugging and sanity checking.

If you do not specify anything as the item’s URI in the name drop-down, Solvent will create a default one for you, but it will be used as the subject for every statement generated. In nearly all cases this is not the correct behavior and you should generate a new unique URI for each item.

If you do not enter a URI as a property when using the Name drop-down (maybe you put in ‘name’ as the property instead of ‘http://example.com/name’), you’ll get an InvocationTargetException. Make sure that any URIs you generate (used as the subject of RDF statements) are URIs as well. It’s quite easy to pick up a invalid or missing URI from header and spacer rows, so check a URI before you use it as the subject of any statements. Usually if the URI is missing it means you can skip the element entirely.

5. Run the code by clicking the ‘Run’ button. Check the Results tab for output. This is RDF. Paste it into a text file and use utilities like grep and wc to do a sanity check on your data (Alternatively, follow the next few steps to create a CSV view of the data to be viewed in a spreadsheet). If your “Top 100 X” list has 93 items, chances are that some of the items weren’t caught by the XPath expression. Try to figure out what’s missing. Use log to generate helpful messages as you go along. When you investigate the cause of this, you may notice that one of the items is missing a surrounding div or something to that effect. Use the web developer extension for Firefox, especially “Display element information” (Ctrl-Shift-F). If only a small amount of the data was not parsed, you may be able to correct it by hand. On the other hand, if there are a lot of inconsistencies it will be very difficult for any program to parse the data.

6. If getting RDF isn’t your end goal, use Babel and Exhibit to convert it to something else. In Babel, convert your RDF (pick the n3 option and either upload your file or paste the text into their textarea) to Exhibit JSON and click ‘Upload and preview’. This will generate an Exhibit, which is a faceted browser view for your data (Note that if you have more than several hundred items this might be slow when you run Exhibit in preview mode).

babel.png
Use Babel to preview and convert data.

This will give you a way to browse the data (which can be helpful for debugging), but more importantly, Exhibit lets you copy the displayed data into another format, like tab-delimited text – handy for pasting into Excel or Many Eyes (To do this, click the ‘Copy All’ button above your data and select the appropriate format). Also, you can generate your own Exhibit showing the facets of your choosing in a more aesthetic way.

For the baseball example, I have gone ahead and done step 7 (the following step), getting me data for 1039 players. This is a bit much for Exhibit, so I decided to upload it to Many Eyes. First, I converted it to tab-delimited text using Babel (which took a while), then I pasted it into Excel. I scanned through the data for “holes” and found a few: some of the Japanese players, like Daisuke don’t have their height and weight listed. That’s not too troubling. One thing that is troubling, however, is the fact that Yoslan Herrera and Juan Miranda aren’t going to be born for another 43 years.

notborn.png
Not born until the future.

Anyhow, I cleaned up the data, saved it as tab-delimited text, and cut and pasted it as a new Many Eyes dataset. Went to work to create a scatterplot, bar graph, and histogram. What did I learn? By the averages, pitchers tend to be younger. DH’s tend to be older and fatter. 2B’s and SS’s tend to be skinnier. The tallest reported player in the league is Jon Rauch, at 6’11.

7. Scrape multiple pages. Oftentimes content is spread over multiple pages. Start out by creating the code to parse a single page (which is what we just did). Then create a new code tab and click the Insert button above the code window and select ‘Code to scrape several pages’. At the bottom of the file you’ll find the interesting part. First, remove the line that says scrapePage(document);.

Populate the urls var with all of the URLs you want to scrape. Sometimes this is easy because the page that you’re on is encoded in the GET request as a parameter so you just need to write a loop and not screw up your airthmetic. Other times the URLs you want are in a list of links on the page. If this is the case, pop open another code tab and use ‘Capture’/’Generate’ to generate some code to extract the URLs from the page. Then integrate this into the gatherPagesToScrape function, creating an array of the extracted URLs.

Integrate the code from scraping a single page by pasting it into scrapePage. piggybank.scrapeURL creates a new web browser window that is used to track all of the URLs that are queued to be opened and scraped. While you copy the code over, try to localize your changes to this function. It will be much easier to copy the code back out to its own window for debugging in case you run into problems.

In our example the natural thing to do is to scrape all of the team pages. One more trick was required for this. If you try to start from the page listing the teams, you’ll run into “Permission denied to get property HTMLDocument.documentElement”. This is because the domains of the two pages are different (www.usatoday.com vs. fantasybaseball.usatoday.com) and Firefox doesn’t allow cross-site scripting. I set up a simple proxy (loosely based on this Yahoo Developer Network proxy) so that I can go through wingerz.com for all of my calls.

As with all large batch jobs, make sure that your code works on a subset of the data before starting your scraper.

This technique can be used to scrape multiple levels of pages. For example, a list of items may be broken up across several pages, and clicking on an item may take you to a page containing details about a particular item. In this case you can get all of the URLs, then call piggybank.scrapeURL, passing in a function to handle a page with a list of items. From inside this function you’ll call piggybank.scrapeURL again, passing in a function to handle a page with details about a single item.