Across the Universe: Cosmology Data Management Workshop Draws Stellar Crowd


CrossConnects1ESnet’s Eli Dart (left), Salman Habib (center) of Argonne National Lab and Joel Brownstein of the University of Utah compare ideas during a workshop break.

ESnet and Internet2 hosted last week’s CrossConnects Workshop on “Improving Data Mobility & Management for International Cosmology,” a two-day meeting ESnet Director Greg Bell described as the best one yet in the series. More than 50 members of the cosmology and networking research community turned out for the event hosted at Lawrence Berkeley National Laboratory, while another 75 caught the live stream from the workshop.

The Feb. 10-11 workshop provided a forum for discussing the growing data challenges associated with the ever-larger cosmological and observational data sets, which are already reaching the petabyte scale. Speakers noted that network bandwidth is no longer the bottleneck into the major data centers, but storage capacity and performance from the network to storage remain a challenge. In addition, network connectivity to telescope facilities is often limited and expensive due to the remote location of the facilities. Science collaborations use a variety of techniques to manage these issues, but improved connectivity to telescope sites would have a significant scientific benefit in many cases.

In his opening keynote talk, Peter Nugent of Berkeley Lab’s Computational Research Division said that astrophysics is transforming from a data-starved to a data-swamped discipline. Today, when searching for supernovae, one object in the database consists of thousands of images, each 32 MB in size. That data needs to be processed and studied quickly so when an object of interest is found, telescopes around the world can begin tracking it in less than 24 hours, which is critical as the supernovae are at their most visible for just a few weeks. Specialized pipelines have been developed to handle this flow of images to and from NERSC.

Salman Habib of Argonne National Laboratory’s High Energy Physics and the Mathematics and Computer Science Divisions opened the second day of the workshop, focused on cosmology simulations and workflows. Habib leads DOE’s Computation-Driven Discovery for the Dark Universe project. Habib pointed out that large-scale simulations are critical for understanding observational data and that the size and scale of simulation datasets far exceed those of observational data. “To be able to observe accurately, we need to create accurate simulations,” he said. Simulations will soon create 100 petabyte sets of raw data, and the limiting factor for handling these will be the amount of available storage, so smaller “snapshots” of the datasets will need to be created. And while one person can run the simulation itself, analyzing the resulting data will involve the whole community.

Reijo Keskitalo of Berkeley Lab’s Computational Cosmology Center described how computational support for the Planck Telescope has relied on HPC to generate the largest and most complete simulation maps of the cosmic microwave background, or CMB. In 2006, the project was the first to run on all 6,000 CPUs of Seaborg, NERSC’s IBM flagship at the time. It took six hours on the machine to produce one map. Now, running on 32,000 CPUs on Edison, the project can generate 10,000 maps in just one hour.

Mike Norman, head of the San Diego Supercomputer Center, offered that high performance computing can become distorted by “chasing the almighty FLOP,” or floating point operations per second. “We need to focus on science outcomes, not TOP500 scores.”

Over the course of the workshop, ESnet Director Greg Bell noted that observation and simulation are no longer separate scientific endeavors.

The workshop drew a stellar group of participants. In addition to the leading lights mentioned above, attendees included Larry Smarr, founder of NCSA and current leader of the California Institute for Telecommunications and Information Technology, a $400 million academic research institution jointly run by the University of California, San Diego and UC Irvine; and Ian Foster, who leads the Computation Institute at the University of Chicago and is a senior scientist at Argonne National Lab. Foster is also recognized as one of the inventors of grid computing.

The next step for the workshop organizers is to publish a report and identify areas for further study and collaboration. Looming over them will be the thoughts of Steven T. Myers of the National Radio Astronomy Observatory after describing the data challenges coming with the Square Kilometer Array radio telescope: “The future is now. And the data is scary. Be afraid. But resistance is futile.”