Public Datasets in the Cloud

  • Rosalyn Metz, Wheaton College, metz_rosalyn@wheatoncollege.edu
  • Michael B. Klein, Oregon State University, Michael.Klein@oregonstate.edu

Code4Lib 2010 - Tuesday, February 23, 2010 - 11:20-11:40

When most people think about cloud computing (if they think about it at all), it usually takes one of two forms: Infrastructure Services, such as Amazon EC2 and GoGrid, which provide raw, elastic computing capacity in the form of virtual servers, and Platform Services, such as Google App Engine and Heroku, which provide preconfigured application stacks and specialized deployment tools. Several providers, however, offer access to large public datasets that would be impractical for most organizations to download and work with locally. From a 67-gigabyte dump of DBpedia’s structured information store to the 180-gigabyte snapshot of astronomical data from the Sloan Digital Sky Survey, chemistry and biology to economic and geographic data, these datasets are available instantly and backed by enough pay-as-you-go server capacity to make good use of them. We will present an overview of currently-available datasets, what it takes to create and use snapshots of the data, and explore how the library community might push some of its own large stores of data and metadata into the cloud.

Slides in PowerPoint (1.32 MB)