A few months ago I was working with Cathleen Finn (in IBM Corporate Community Relations), and she mentioned the pains she had to go through to get lists of employees working in New England states (for the purpose of sending recruitment emails). The process involves making a request to someone in another group and waiting for them to come back with a spreadsheet. Of course this is a process that has to be done every time and can take several days. It’s especially frustrating because we have all of the data (in a LDAP directory) but there isn’t any good way to get at it to solve this problem.
Lee, Elias, and I had mucked around with SquirrelRDF a while back, running it from the command line. Since I didn’t have access to that machine, I looked up our LDAP schema, wrote a mapping file, and wrote a SPARQL query to pull email addresses for everyone working at an office location in Massachusetts. It took a while to run the query, but in the end I got exactly the data I was looking for, no more and no less.
I played around with SquirrelRDF a bit more intending to set it up as a service to be used in some of our internal Semantic Web demos. I wrote up some of what I did in a recently published developerWorks article.
I’ve put together a short tutorial on Solvent, a very nice web page parsing utility. It is still a little rough around the edges, but I wanted to throw it out there and continue working on it since there isn’t a whole lot of existing documentation.
Yesterday I blogged about creating an Exhibit for a list of the 100 best-selling games of 2006. Exhibit is great for looking at how data items fall into categories, but it’s not as good for visualizing quantities. IBM’s own Many Eyes provides several very nice visualization tools (Swivel allows data upload and visualization as well, but I am not that familiar with it, and it looks like someone beat me to it).
I uploaded my text data and created a few quick visualizations.
Review score vs. sales. As people have already remarked, a well-reviewed game won’t necessarily sell that well. Alas.
Release month. This is a recreation of one of the charts that appeared in the original Next Generation article. Summer is always kind of quiet and things get more exciting towards the holidays.
Categorization treemap. This is one of my favorite data viewers. Each game is a rectangle. The area is the number of sales. You can drag the labels (next to “Treemap organization”) in order to redraw the treemap. Drag “publisher” all the way to the left to see why EA cranks out annual releases of their sports titles. Drag “genre” over to see the portion of sales that are sports titles or games based on licenses. Dragging “systems” over doesn’t give you a great view of the data because the original data wasn’t all that clean and Many Eyes doesn’t seem to handle multi-value properties. I’m not sure why it’s showing a quote about the game by default instead of the title.
My other favorite data viewer (that I was not able to use) is the Stacked Graph viewer, made popular by the Baby Name Voyager.
One last note: I wasn’t allowed to edit the visualizations after I created them, so keep that in mind as you think of titles and tags for them.
A few weeks ago I came across an article about the top-selling games of 2006. There’s some analysis, then a list of the top 100 games spread across 10 web pages (starting, of course, with games ranked 100 to 91). Unfortunately, there isn’t a great way to really take a close look at the data. For example, I really wanted to see some Nintendo-specific analysis.
The data was screaming to be let out, so I scraped it and put it into an Exhibit. It was not a quick and easy process. I am quite certain that the HTML was hand-coded – the quotes start with
|, or nothing at all, and some of the other elements are mixed up.The game platforms are not very well specified so I may need to go through and clean it up later; for this reason the portable/homeconsole sections are not 100% accurate.
Anyhow, now I have a perl Data::Dumper file, tab-delimited text file, and a JSON representation. Will probably upload the text file to Many Eyes for kicks.
Just about every computer user is very familiar with and competent in text search. While end users may not be writing custom search queries, they appreciate UIs that allow them to search with more accuracy and precision. Occasionally users want to find something very specific by searching across people’s names, or book titles, or paper abstracts instead of all of the indexed text in a system. Clever keyword searching and luck can only get you part of the way there.
Sleuth, Boca‘s text indexing component, addresses this problem (in the Boca world). We’ve been using it for quite a while. Similar to LARQ, Boca uses Lucene to index string literals when the feature is enabled. We’ve designated a magic predicate for querying the text index with SPARQL and hooked it into Glitter, our wonderfully-named SPARQL engine. So now we can do SPARQL queries with integrated text queries, like “find me people (not airplane components or animal appendages) where the name matches ‘Wing’”:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX boca: <http://boca.adtech.internet.ibm.com/predicates/>
SELECT ?person ?name
?person foaf:name ?name .
?name boca:textmatch "Wing" .
This powerful feature allows SPARQL-aware developers to roll their own APIs. It’s easy to whip up a search across the all literals for traditional text search behavior. With a little more work, you can craft more sophisticated searches, like one for authors of a paper that mentions a specific search term in the abstract (say, “march madness”).
For more details on how to set this up, please see our documentation on Boca text indexing.
I recently purchased my first DSLR camera. It wasn’t an easy decision, and at some point I was looking for sample photos taken by a non-DSLR under a certain condition (wide aperture). I started with the Flickr Camera Finder. There is so much wonderful data on pages like the list of Canon cameras and the individual camera pages. The data can be viewed in several ways, but it all just leaves me wanting more. Sure, for a particular camera I can search for pictures tagged with “food”, but what if I want to specify photos with a wide aperture that were taken on November 23, 2005?
They’re sitting on a gold mine of data, but the only way to get at it is through the web API (The advanced search is not very powerful). It’s possible to get at some of the EXIF data (photo metadata), but only if you have the ID for a photo; there’s no way to search across all of the images. Even if they managed to implement this particular interface, what if I want to search for photos that satisfy these restrictions that were posted by users within three friend-links of me?
If Flickr slaps a SPARQL endpoint on its data, it opens up all sorts of amazing possibilities. Using API keys, they could allow paid access to the data from photo equipment sellers (and free access to web hackers), who would be able to offer their customers the ability to find pictures taken with particular cameras and lenses and the people who own them (possibly restricting this set of people to friends or foafs). Of course, Flickr could put together a proprietary web API and do this now, but then they would have to code up every new API method request themselves rather than letting data subscribers write their own queries. And SPARQL-able data has the additional benefit of being easier to integrate with other sources.
Our group at IBM Cambridge has open-sourced our Semantic Web projects. Check it out: IBM Semantic Layered Research Platform (with documentation and downloads).
Boca, our enterprise-ready RDF store, is the only component that has been officially released; others are still works in various stages of development. Matt’s post covers Boca’s most important features.
Our group has set up PlanetAdtech for Semantic Web-related blog posts (including blogs belonging to colleagues who are not on PlanetRDF). Subscribe to the feed to track our work; we’ll be releasing more components over the next few weeks.
The elections have come and gone (I voted), and the Pundit’s Monitor has left me thinking about how much better it could have been had the content sources been marked up with simple, barebones eRDF or RDFa. Elias was searching for full names (like “Arnold Schwarzenegger“) in the text of blog posts, so he would have missed references to informal names (like “Ahhhnold” or “the Governator“); clearly these would have been the most amusing entries to read. There were a also few false positives lurking in the results for the candidates with more common names. And the world probably would have come crashing down around us had their been two candidates with the same name.
It would have been great to have entries marked up with URIs of people, states, and races. Text analysis can take you pretty far, but it sounds like a lot of work to extract very specific, valuable information that was very clearly in the minds of the bloggers. Starting small by tagging proper nouns with URIs seems like a good way to get the ball rolling for more widespread SW adoption.
Recently I read a blog entry by Adrian Holovaty on how newspaper publishers should focus on providing news in a somewhat structured form instead of plain text blobs so that it can be analyzed in bulk, mashed-up, and repurposed. The entire entry screams Semantic Web, as a few of the commenters pointed out. So here’s another Semantic Web daydream, like the ones put forth by Lee (in law) and Elias (in everyday life).
Modeling data: In the article Holovaty mentions several types of articles that have a specific structure (like wedding announcements, obituaries, etc.). It’s not difficult to imagine the design for a traditional database system for storing this. For example, consider two tables: one containing data about articles (date published, title) and one containing data about people (name, contact information, etc.). To link articles to people (a many-to-many relationship), we need to create a third table with two columns: one to hold an article ID and one to hold a person ID. Every row in this table would represent a link between and article and a person. Of course, articles and people can be linked in several ways; some possibilities include the person as a major character, minor character, editor, writer, or interviewee. We could create additional tables, one per type of relationship, or we could add a third column to our join table and keep track of the relationship between article and person.
Of course, it gets even more complicated when you realize that each type of article needs its own table since it has its own set of defining traits. And perhaps you want a table of places (name, street address, latitude, longitude). All of these things need to linked together.
Our previously mentioned table with three columns (article, person, and relationship) is a shadow of the Semantic Web. This particular table is quite limited because the article and person are identified by an integer id that is only unique to those tables. In RDF, the core Semantic Web standard, data is expressed as a set of subject-predicate-object triples. In this example, (subject, predicate, object) = (article, relationship, person). Globally unique, resolvable URIs are used instead of integers to identify entities (called resources, the R in RDF), and they are also used to identify predicates.
Now, adding a new relationship between two resources is easy – just pick a predicate to link them up and add the new (s, p, o) triple. Resources can also be linked to literal data like strings and numbers so any data object can be modelled. There’s no more jumping through data modeling hoops. Because it’s so easy to model and create data, there’s going to be a lot more of it and it will be more descriptive.
Analyzing the data with SPARQL: Using a traditional system, you’d have to spend some time designing a data access API which probably would not be as expressive as you would like. It would also be quite brittle; changes to your data schema would need to be bubbled up to the access API. Opening up RDF data via a SPARQL endpoint would give users a powerful tool to analyze news – instead of being limited to restrictive APIs they are allowed to freely explore the data. And because the data is encoded in RDF, following the relationships between different resources is a trivial matter (that doesn’t involve joining three database tables). Assuming the appropriate triples had been encoded, you could write the following queries: “Find all articles from 2006 mentioning Microsoft that quote Sam Palmisano” and “Find recaps of Laker games where they won by three or fewer points.” Note that both of these queries aren’t easy to do via a text search, but are quite straightforward in SPARQL.
Getting data into the Semantic Web: One of the problems Holovaty cites is that journalists are resistant to change. Fortunately, research on semantic wikis (like Semantic Mediawiki) should lead to some interesting and intuitive text-based systems for writing prose and entering the relevant RDF triples in a simple manner.
It’s fun to do these thought experiments, and they go a long way towards convincing us that we’re onto something here. A system like this would be relatively easy to maintain and provide a great service for analyzing current events and mashing them up.
You can construct your query on the ‘Query’ tab and set the endpoint and add graphs on the ‘Graphs’ tab.
The code still needs a bit of clean-up and I’d like to tweak a few more things, but I thought I’d throw it out there. If you click on ‘Get Results’ with all of the default settings, you should get some results. Note that it currently only works in Firefox and requires the UniversalBrowserRead privilege to run.
« Previous entries Next Page » Next Page »