I'm a data scientist who has been working with big data and predictive Analytics since 1988. I'm a faculty member at the University of Chicago, the Director of the Open Commons Consortium, and the Founder and a Partner of Analytic strategy Partners.

What is a Data Commons and How You Can Build One? (Mol Tri Conf 2018)

What is a Data Commons and How You Can Build One? (Mol Tri Conf 2018)

db-to-data-commons.jpg

I have given several talks over the past several months explaining what a data commons is and how you can build one, including a talk at the Molecular Tri Conference in San Francisco on February 12, 2018 and the CI4CC Conference in Hawaii on April 3, 2018.

A commons is a term used in economics to refer to resources that are held in common (and not owned privately) that a group or community manage for individual and collective benefit.  A good example are pastures that are used by a village that are available for all the cows in the village.  What is called the tragedy of the commons occurs when individuals try to maximize their share of a resource, which can deplete the resource and harm others who were relying on it.  In 1968, Garrett Hardin wrote a thoughtful and provocative article about this in the context of the population problem that is still worth reading.

I think of a data commons as a data platform that co-locates data with cloud computing infrastructure and commonly used software services, tools & applications for managing, analyzing and sharing data to create an interoperable resource for the research community.  This is the definition in the article PMID 29033693, which also describes some of the requirements for a data commons.

The NCI Genomic Data Commons (GDC) is a good example of a data commons.  It is a system of systems.  The systems include: a data portal, a data submission system, a data harmonization system, and an API so that third parties can build an ecosystem of applications and services over the harmonized data in the GDC.  The GDC contains over 3 PB of harmonized data (that is data that is processed by a common set of bioinformatics pipelines) and is used by over 100,000 unique researchers each year.

There is no consensus today in how to build a data commons and these talks describe our experience developing and operating data commons over the past 10 years.  Our focus today is on how to best interoperate data commons, a topic i'll be addressing in other blog posts.

Sometimes when developing systems, it is helpful to have a point of view.  I like to think of databases as organizing data for a project or department, a data warehouse as organizing data for organization or virtual organization, and a data commons as organizing data for discipline or field.

References

Grossman, Robert L., Allison Heath, Mark Murphy, Maria Patterson, and Walt Wells. A case for data commons: toward data science as a service. Computing in science & engineering 18, no. 5 (2016): 10-20.  PMID 29033693

Hardin, Garrett, The Tragedy of the Commons, Science, Volume 13, 1968, pages 1243-1248. 

Ostrom, Elinor (2015), Governing the Commons: The Evolution of Institutions for Collective Action.

The Genomic Data Commons (GDC) Two Year Anniversary

The Genomic Data Commons (GDC) Two Year Anniversary