Newspapers and the Semantic Web

Posted in semantic web, technology, web at 11:30 pm by wingerz


Recently I read a blog entry by Adrian Holovaty on how newspaper publishers should focus on providing news in a somewhat structured form instead of plain text blobs so that it can be analyzed in bulk, mashed-up, and repurposed. The entire entry screams Semantic Web, as a few of the commenters pointed out. So here’s another Semantic Web daydream, like the ones put forth by Lee (in law) and Elias (in everyday life).

Modeling data: In the article Holovaty mentions several types of articles that have a specific structure (like wedding announcements, obituaries, etc.). It’s not difficult to imagine the design for a traditional database system for storing this. For example, consider two tables: one containing data about articles (date published, title) and one containing data about people (name, contact information, etc.). To link articles to people (a many-to-many relationship), we need to create a third table with two columns: one to hold an article ID and one to hold a person ID. Every row in this table would represent a link between and article and a person. Of course, articles and people can be linked in several ways; some possibilities include the person as a major character, minor character, editor, writer, or interviewee. We could create additional tables, one per type of relationship, or we could add a third column to our join table and keep track of the relationship between article and person.

Of course, it gets even more complicated when you realize that each type of article needs its own table since it has its own set of defining traits. And perhaps you want a table of places (name, street address, latitude, longitude). All of these things need to linked together.

Our previously mentioned table with three columns (article, person, and relationship) is a shadow of the Semantic Web. This particular table is quite limited because the article and person are identified by an integer id that is only unique to those tables. In RDF, the core Semantic Web standard, data is expressed as a set of subject-predicate-object triples. In this example, (subject, predicate, object) = (article, relationship, person). Globally unique, resolvable URIs are used instead of integers to identify entities (called resources, the R in RDF), and they are also used to identify predicates.

Now, adding a new relationship between two resources is easy – just pick a predicate to link them up and add the new (s, p, o) triple. Resources can also be linked to literal data like strings and numbers so any data object can be modelled. There’s no more jumping through data modeling hoops. Because it’s so easy to model and create data, there’s going to be a lot more of it and it will be more descriptive.

Analyzing the data with SPARQL: Using a traditional system, you’d have to spend some time designing a data access API which probably would not be as expressive as you would like. It would also be quite brittle; changes to your data schema would need to be bubbled up to the access API. Opening up RDF data via a SPARQL endpoint would give users a powerful tool to analyze news – instead of being limited to restrictive APIs they are allowed to freely explore the data. And because the data is encoded in RDF, following the relationships between different resources is a trivial matter (that doesn’t involve joining three database tables). Assuming the appropriate triples had been encoded, you could write the following queries: “Find all articles from 2006 mentioning Microsoft that quote Sam Palmisano” and “Find recaps of Laker games where they won by three or fewer points.” Note that both of these queries aren’t easy to do via a text search, but are quite straightforward in SPARQL.

Getting data into the Semantic Web: One of the problems Holovaty cites is that journalists are resistant to change. Fortunately, research on semantic wikis (like Semantic Mediawiki) should lead to some interesting and intuitive text-based systems for writing prose and entering the relevant RDF triples in a simple manner.

It’s fun to do these thought experiments, and they go a long way towards convincing us that we’re onto something here. A system like this would be relatively easy to maintain and provide a great service for analyzing current events and mashing them up.

1 Comment »

  1. Elias Torres » Blog Archive » My Semantic Web search engine said,

    October 24, 2006 at 2:04 am

    […] My Semantic Web search engine I’m done reading my nightly dose of papers, a few were dumb, but a couple a bit more interesting when I noticed a recent Google blog post on Custom Search Engine. It’s basically a Rollyo rip-off, except it comes with ads, but I digress. I had to do the obvious and created a Semantic Web search engine. Anyone is allowed to add sites to it (no invitation required). They have a simple UI to add sites either one at a time or bulk uploading. They do support OPML (yay not) but to make it even worse, you must “tag” your URLs with cryptic include/exclude tags specific to my search engine: bleuch!. I have not been the most optimistic SW enthusiast lately, but you look at these hacks and wonder whether this such simplicity of OPML or one-of XML formats is the true sweet spot for data interchange on the web. At least, Wing is sharing his positive thinking on SW so I can have sweet dreams and not think about the horrendous paper I had to read for school tonight.. […]

Leave a Comment