Summer 2012 Week 3: Big Data and Long Tails: Addressing the Cyber-Infrastructure Challenges for Research on a Budget

July 28, 2012 to August 4, 2012

Canyons Resort, 4000 Canyons Resort Drive, Park City, Utah

Organizing Committee

  • Christine Borgman (University of California Los Angeles)
  • Ian Foster (Argonne National Laboratory/University of Chicago)
  • Bryan Heidorn (University of Arizona)
  • Bill Howe (University of Washington)
  • Carl Kesselman (University of Southern California/Information Sciences Institute)

Scientific Overview

Decade-long big-science projects such as the human genome project, large hadron collider, LIGO gravitational wave observatory, and earth observation system have created datasets of unprecedented size that seem likely to revolutionize entire fields of science. Due to advances in sensors, computation and storage, the cost and effort required to produce of datasets of comparable scale can, in principle at least, is significantly reduced.  As a result, we are  seeing  a proliferation of large-scale data sets, assembled in dozens of different fields spanning the physical sciences and engineering, medicine and social sciences as well. The scientific opportunities inherent in this “big data” revolution are enormous. But given finite resources, we now face the challenge of exploiting these opportunities at a budget level per project dramatically lower than for the big science projects that pioneered advanced cyberinfrastructure and big data methods. Equally challenging is the need to promulgate new big data methods to communities that lack the expertise and resources possessed by big science projects. The “long tail” of science has data challenges even greater than those of “big science.” The large number of small projects and collaborations in the long tail produce ever larger volumes of data, yet lack the large shared instruments, data repositories, community standards for data structures and metadata, and critical mass of data management expertise.

Fortunately, concurrent with these trends, there has also been significant advances in the commercial computing environment in that acquisition and analysis of extremely large scale data-sets driven by ecommerce, social networking and the Web has become commonplace as have the tools and infrastructure for such a commodity. 15 years ago, Jim Gray significantly altered the scientific infrastructure landscape by asserting and then proving that relational database technology, the workhorse of traditional enterprise systems had significant unrecognized value to the scientific community. Subsequently, the use of relational databases has become prevalent in research infrastructure, often in lieu of large, special-purpose software development activities. We now appear to be at a similar inflection point, with technologies of search, Internet commerce, such as Hadoop-enabled scalable data servers, large-scale data analytics, software as a service browser-based applications hosted on commodity clouds, and the semantic web have the potential to significantly alter the way data is captured, analyzed, and shared in scientific investigations.

Much as Jim Gray did, we are at a point where it will beneficial to assess the impact of these new technologies, to understand  the big data and long-tail requirements of a range of scientific communities with the goal of understanding how these common tools and infrastructure apply to scientific data processes and in the process, putting big data in the hands of a broader community of scientists. There are many potential issues that may get in the way: data volumes are larger: orders of magnitude bigger in many cases.  Budgets are often smaller. Uses are more idiosyncratic. Small research teams will have limited information technology and computer science expertise.  These factors all make the long tail problem in many ways more difficult than the issues facing big science projects.

Our goals of this workshop are to characterize, and where possible quantify, the needs of diverse scientific communities for “big data” technologies; explore existing and new methods for meeting those needs in ways that can scale to large numbers of people (whether working alone, in small teams, or in larger aggregations), and large, diverse, distributed data; and to identify foundational elements of a big data/long tail ecosystem that may accelerate progress towards meeting those needs. In addition to considering technologies, we will examine structural barriers to the effective use of big data, such as data sharing habits and skills gaps, and means of overcoming those structural barriers. The output from the workshop will be a position paper that will identify the major challenges that we have identified and make recommendations as to how these challenges might be addressed.

The meeting will be organized as a series of “mini-workshops” which will focus on topics spanning specific communities of use, technology approaches and social, structural and organizational issues. We are recruiting a set of topic experts to lead these “mini-workshops”.

