Wednesday, July 11, 2012

Describing Datasets with

Earlier this year, we received a proposal for a 'Datasets' addition to, via the Web Schemas group at W3C. Based on informal conversations with various potential publishers and consumers, this work has great potential and we would like to invite interested parties to take a detailed look at the proposal, to identify any implementation issues or potential improvements.  It is a small but useful vocabulary and the wiki overview also includes a summary of its relationship to related initiatives from the Linked Data community. More details and demos are also available from RPI.

As with all such proposals, it is listed in our Wiki area and we encourage comments and discussion on the mailing list. As grows into more specialist areas, we are aware that we can't expect everyone to join one big mailing list. In particular for this Datasets vocabulary, which is particularly relevant to the community around open government and public-sector data, we want to take care to solicit comments from potential publishers. As always, comments are welcomed in the public W3C-hosted mailing list and Wiki, via blog discussions, or if you prefer, by direct email to the team.

This topic is particularly exciting due to the huge number of datasets that have been made public in recent years. While each dataset may ultimately be expressed in detailed, domain-specific form (e.g. using specific scientific or statistical schemas), the Datasets proposal focuses on the high level common characteristics that are shared across thousands of otherwise diverse datasets.

So what are the next steps with this vocabulary?  We would like to hear from publishers of such datasets, to confirm what we've been hearing anecdotally, which is that such an extension to would be useful, used and a good fit to the available metadata.

As always with, the hard work is in building and demonstrating rough consensus around a design. This week's post on the site from Chris Musialek is an important step in that direction, and we welcome comments from others that will help us move things forward. From Chris's post:

We've been watching the datasets schema space for a while now, as is very interested in adding support for our listing of over 450,000 datasets. We think this will help the major search engines create better relevance rankings of Federal government data, where many searches begin.
We wanted to come out publicly saying that we've reviewed the current datasets schema proposal in draft, and we are comfortable with the current state of things. There is definitely work still left to do, but there seems to be pretty solid agreement on everything but the details, which seem very resolvable. At this point, if the group would solidify on the dataset proposal, then would support and use it.

Many thanks to Chris for opening the conversation about this work. If you have feedback on any aspect of the Datasets proposal, do please share your experience...