Wednesday, July 11, 2012

Describing Datasets with schema.org

Earlier this year, we received a proposal for a 'Datasets' addition to schema.org, via the Web Schemas group at W3C. Based on informal conversations with various potential publishers and consumers, this work has great potential and we would like to invite interested parties to take a detailed look at the proposal, to identify any implementation issues or potential improvements.  It is a small but useful vocabulary and the wiki overview also includes a summary of its relationship to related initiatives from the Linked Data community. More details and demos are also available from RPI.

As with all such proposals, it is listed in our Wiki area and we encourage comments and discussion on the public-vocabs@w3.org mailing list. As schema.org grows into more specialist areas, we are aware that we can't expect everyone to join one big mailing list. In particular for this Datasets vocabulary, which is particularly relevant to the community around open government and public-sector data, we want to take care to solicit comments from potential publishers. As always, comments are welcomed in the public W3C-hosted mailing list and Wiki, via blog discussions, or if you prefer, by direct email to the schema.org team.

This topic is particularly exciting due to the huge number of datasets that have been made public in recent years. While each dataset may ultimately be expressed in detailed, domain-specific form (e.g. using specific scientific or statistical schemas), the Datasets proposal focuses on the high level common characteristics that are shared across thousands of otherwise diverse datasets.

So what are the next steps with this vocabulary?  We would like to hear from publishers of such datasets, to confirm what we've been hearing anecdotally, which is that such an extension to schema.org would be useful, used and a good fit to the available metadata.

As always with schema.org, the hard work is in building and demonstrating rough consensus around a design. This week's post on the data.gov site from Chris Musialek is an important step in that direction, and we welcome comments from others that will help us move things forward. From Chris's post:

We've been watching the schema.org datasets schema space for a while now, as Data.gov is very interested in adding schema.org support for our listing of over 450,000 datasets. We think this will help the major search engines create better relevance rankings of Federal government data, where many searches begin.
We wanted to come out publicly saying that we've reviewed the current datasets schema proposal in draft, and we are comfortable with the current state of things. There is definitely work still left to do, but there seems to be pretty solid agreement on everything but the details, which seem very resolvable. At this point, if the group would solidify on the dataset proposal, then Data.gov would support and use it.

Many thanks to Chris for opening the conversation about this work. If you have feedback on any aspect of the Datasets proposal, do please share your experience...

16 comments:

  1. I am surprised that the DCAT and Schema.org have not tackled the inclusion of a data dictionary that would allow decision or action on a found data set. As things currently stand, only keywords or concepts appear to be supported.

    In the data publishing world, where structured, mostly tabular data files are being made available, the name, meaning, data type, and units of measure of individual fields in the data set are of utmost importance to end-use. Knowing a concept or uncontrolled keyword may get a user to the data, but extension to solicit the publication of the data fields (aka columns, properties, values) will enable end-users to assess and apply the data based on explicit rather than generalized syntactic and semantic clues.

    The FGDC metadata standard and now the draft ISO 19115 standard support the capture of field information, known variously as feature catalog properties, or entities and attributes. These are the elements that to me are the core of any vocabulary. They are the realization of 'concepts' and may have a number of associations. I'd like to see progress in the data cataloguing arena towards capturing and exploiting these properties.

    ReplyDelete
    Replies
    1. Thanks for the suggestion. For the sake of simplicity, I think it's important to keep the initial schema.org extension focused on high-level metadata about a dataset, rather than delving into the domain model or schema of the dataset. However, further extensions are possible. DCAT does contain a dataDictionary property which might be adapted to point to a fine-grained description of the dataset's vocabulary. The first step would be to define an ontology at that (DCAT's) level, of which a subset could be included in the extension. VoID is an example of an OWL ontology which deals with dataset vocabularies.

      Delete
  2. Wanted to add to this discussion on Learning Resources type: http://www.w3.org/wiki/WebSchemas/LearningResources

    We aggregate scholarships for students and would like this to be considered within the Learning Resources item type.

    Further details would be amount, deadline, and application requirements

    ReplyDelete
  3. I work for the National Center for Atmospheric Research in Boulder, Colorado. Earlier this summer at my recommendation, we began implementing Schema.org in order to increase the discoverability of our datasets. The current vocabulary was quite minimal to meet our needs, but it was a start. I a VERY much looking forward to being a part of the Datasets addition that this blog post describes.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
    Replies
    1. This comment has been removed by a blog administrator.

      Delete
  5. This comment has been removed by a blog administrator.

    ReplyDelete
  6. This comment has been removed by a blog administrator.

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete
  8. This comment has been removed by a blog administrator.

    ReplyDelete
  9. This comment has been removed by a blog administrator.

    ReplyDelete
  10. This comment has been removed by a blog administrator.

    ReplyDelete
  11. This comment has been removed by a blog administrator.

    ReplyDelete
  12. Can somebody help me with the following:
    We sell books and journals. Books have an ISBN number and that is listed on the following page: http://schema.org/Book. Journals have an ISSN. This is not on the page. Where can I request to add this tag?

    Thanks, Jeroen

    ReplyDelete
  13. Can you Tell me How to add schema in Blogger blog What is my Results

    ReplyDelete
  14. Do you have plans to make SlideShare an itemprop so that a particular slideshow would show up next to the search results on Google the same way an author picture or video shows up?

    ReplyDelete

Note: Only a member of this blog may post a comment.