Using SPARQL for Good: Querying LDAP with SquirrelRDF

Posted in development, semantic web, technology at 12:34 am by wingerz


A few months ago I was working with Cathleen Finn (in IBM Corporate Community Relations), and she mentioned the pains she had to go through to get lists of employees working in New England states (for the purpose of sending recruitment emails). The process involves making a request to someone in another group and waiting for them to come back with a spreadsheet. Of course this is a process that has to be done every time and can take several days. It’s especially frustrating because we have all of the data (in a LDAP directory) but there isn’t any good way to get at it to solve this problem.

Lee, Elias, and I had mucked around with SquirrelRDF a while back, running it from the command line. Since I didn’t have access to that machine, I looked up our LDAP schema, wrote a mapping file, and wrote a SPARQL query to pull email addresses for everyone working at an office location in Massachusetts. It took a while to run the query, but in the end I got exactly the data I was looking for, no more and no less.

I played around with SquirrelRDF a bit more intending to set it up as a service to be used in some of our internal Semantic Web demos. I wrote up some of what I did in a recently published developerWorks article.


Using Solvent to extract data from structured pages

Posted in development, semantic web, technology at 10:06 am by wingerz


I’ve put together a short tutorial on Solvent, a very nice web page parsing utility. It is still a little rough around the edges, but I wanted to throw it out there and continue working on it since there isn’t a whole lot of existing documentation.


Looking at game data through Many Eyes

Posted in development, games, semantic web, technology, web at 1:53 pm by wingerz


Yesterday I blogged about creating an Exhibit for a list of the 100 best-selling games of 2006. Exhibit is great for looking at how data items fall into categories, but it’s not as good for visualizing quantities. IBM’s own Many Eyes provides several very nice visualization tools (Swivel allows data upload and visualization as well, but I am not that familiar with it, and it looks like someone beat me to it).

I uploaded my text data and created a few quick visualizations.
Review score vs. sales. As people have already remarked, a well-reviewed game won’t necessarily sell that well. Alas.
Release month. This is a recreation of one of the charts that appeared in the original Next Generation article. Summer is always kind of quiet and things get more exciting towards the holidays.
Categorization treemap. This is one of my favorite data viewers. Each game is a rectangle. The area is the number of sales. You can drag the labels (next to “Treemap organization”) in order to redraw the treemap. Drag “publisher” all the way to the left to see why EA cranks out annual releases of their sports titles. Drag “genre” over to see the portion of sales that are sports titles or games based on licenses. Dragging “systems” over doesn’t give you a great view of the data because the original data wasn’t all that clean and Many Eyes doesn’t seem to handle multi-value properties. I’m not sure why it’s showing a quote about the game by default instead of the title.

My other favorite data viewer (that I was not able to use) is the Stacked Graph viewer, made popular by the Baby Name Voyager.

One last note: I wasn’t allowed to edit the visualizations after I created them, so keep that in mind as you think of titles and tags for them.

Popular video games of 2006 Exhibit

Posted in development, games, semantic web, technology, web at 12:29 am by wingerz


A few weeks ago I came across an article about the top-selling games of 2006. There’s some analysis, then a list of the top 100 games spread across 10 web pages (starting, of course, with games ranked 100 to 91). Unfortunately, there isn’t a great way to really take a close look at the data. For example, I really wanted to see some Nintendo-specific analysis.

The data was screaming to be let out, so I scraped it and put it into an Exhibit. It was not a quick and easy process. I am quite certain that the HTML was hand-coded – the quotes start with “, ", |, or nothing at all, and some of the other elements are mixed up.The game platforms are not very well specified so I may need to go through and clean it up later; for this reason the portable/homeconsole sections are not 100% accurate.

Anyhow, now I have a perl Data::Dumper file, tab-delimited text file, and a JSON representation. Will probably upload the text file to Many Eyes for kicks.


IBM-SLRP Release

Posted in development, semantic web, technology at 4:44 pm by wingerz


Our group at IBM Cambridge has open-sourced our Semantic Web projects. Check it out: IBM Semantic Layered Research Platform (with documentation and downloads).

Boca, our enterprise-ready RDF store, is the only component that has been officially released; others are still works in various stages of development. Matt’s post covers Boca’s most important features.

Our group has set up PlanetAdtech for Semantic Web-related blog posts (including blogs belonging to colleagues who are not on PlanetRDF). Subscribe to the feed to track our work; we’ll be releasing more components over the next few weeks.


Sorry, Bloglines

Posted in development, technology at 12:39 am by wingerz


I’ve fully switched over from Bloglines to Google Reader. And I feel horrible about it. Now in addition to knowing about my email, Google knows what I read online. Fortunately, I’ve switched over to using Yahoo Search so they don’t know everything about my digital self. Bloglines is a nice app, and I feel bad for leaving. In the end, it came down to a few features:

  • Marking read items: Google Reader marks items as you read them. If you open up a feed with many many items and don’t want to get through them in one sitting, this is awesome. In Bloglines, once you open up a feed, all of the unread items get marked as read.
  • Lots of items: Bloglines limits the number of new entries per feed to 200. In practice I run into this rarely, but it’s still a lame restriction.
  • Viewing past entries: Google Reader lets you scroll through already-read entries, while Bloglines makes you pick a time period (like “within the last 24 hours”) for retrieving entries.

I still hate the fact that Google Reader won’t let you assign a tag to a blog when you subscribe to it through Firefox 2’s button in the address bar.

One feature that I like about both is the ability to share entries in your own feed. Google Reader lets you do it with one click (“Share”) while Bloglines opens up another window to let you write some text to go along with a marked item. Google Reader should let you do this too because sometimes it’s not clear whether you are sharing something because you agree with it or because you think it’s somewhat outrageous. In any case, I’ve set up my shared items to display in the sidebar of my blog using Simple Pie, a php library (which comes with a WordPress plugin) for parsing feeds.

When I exported my OPML file, it included the Bloglines news feed, so I can keep an eye on new features that might bring me back.


Making John Corwin Proud

Posted in development, technology at 12:14 am by wingerz


Here’s some Python I wrote tonight for filtering our entries to generate feeds for Pundit Monitor. I know it’s basic stuff in the FP world, but it makes me happy since I’ve been programming in Java at work for the past few years. Elegance, how I’ve missed you.

def by(f):
  return lambda x: reduce (
    lambda a, b: a and b, 
    map(lambda k: x[k] == f[k], f.keys())
cahouse = {
  "state" : "California",
  "racetype" : "house"
entries = filter (by(cahouse), allentries)


Political blog tracker

Posted in development, technology, web at 4:26 pm by wingerz


Elias was telling me about his final project for a distributed systems class that he’s currently taking – he set up a political blog crawler. Check it out. And digg it!

The crawler is based on Nutch and Hadoop. It finds entries from thousands of blogs about candidates in the upcoming elections. He suckered me into writing some Python to transform the output into nicely-formatted HTML (with some help from Alister). Contrary to Elias’s blog post, I’m not wildly into politics, but was more interested in playing with the data and learning Python. Feeds for states and individual races should hopefully be up by tomorrow morning.


Recently Dugg on the Sidebar

Posted in amusing, blog, development, technology at 6:45 pm by wingerz


Digg is not the best source of news, but it certainly is amusing. Lots of fun YouTube videos and an occasional interesting blog post or article. Comments are usually good for a laugh too. Users submit links and other users digg it up or bury it down. For the most part I only monitor the popular stories.

I threw together a quick-and-dirty WordPress plugin to show the last 10 things that I dugg in the sidebar. You can do it by sourcing some Javascript from digg (it just does document.write()s – no JSON). Rather than rely on that, I curl it every few minutes and source the local copy instead.

A quick scan of the current headlines makes me quite happy.


Blog Reorg

Posted in blog, development, technology at 7:09 pm by wingerz


One of the weird things about programming is that you can spend a good amount of time doing something like refactoring or reorganizing your code and you feel a sense of accomplishment afterwards, even if no one notices.

I’ve heard that it’s good to have some area of focus for your blog. Unfortunately, there are a variety of topics that I enjoy posting about. Rather than set up two blogs, I’ve decided to separate the content into four main areas:

  • Technology: Work, coding, techy stuff.
  • Food: Eating in and dining out, hopefully with a lot of drool-inducing photos.
  • Games: Video games and other diversions, mostly Nintendo fanboyism.
  • Personal: Life events and observations, along with wacky stuff that comes my way.

The few lines of code that I had to write/update dealt with feed generation – I just wanted to be able to pass a category name to the function that creates feed links. I also had to modify one SQL statement so that it would fetch comments on posts in a specified category rather than all comments on all posts. WordPress’s comment feed code was RSS only, so I decided to use Feedburner to expose the feed in other formats.

I’m still playing around with the category hierarchy. I’d also like to make it more obvious that there are four areas, probably by applying a different theme to each one.

In the end, I’m not sure how much of a difference this will make because I’m surrounded by engineers who love to play video games and cook.

« Previous entries Next Page » Next Page »