ESnet’s Petascale DTN Project Speeds up Data Transfers between Leading HPC Centers


Operations staff monitor the network in the ESnet/NERSC control room. (Photo by Marilyn Chung, Berkeley Lab)

The Department of Energy’s (DOE) Office of Science operates three of the world’s leading supercomputing centers, where massive data sets are routinely imported, analyzed, used to create simulations and exported to other sites. Fortunately, DOE also runs a networking facility, ESnet (short for Energy Sciences Network), the world’s fastest network for science, which is managed by Lawrence Berkeley National Laboratory.

Over the past two years, ESnet engineers have been working with staff at DOE labs to fine tune the specially configured systems called data transfer nodes (DTNs) that move data in and out of the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory and the leadership computing facilities at Argonne National Laboratory in Illinois and Oak Ridge National Laboratory in Tennessee. All three of the computing centers and ESnet are DOE Office of Science User Facilities used by thousands of researchers across the country.

The collaboration, named the Petascale DTN project, also includes the National Center for Supercomputing Applications (NCSA) at the University of Illinois in Urbana-Champaign, a leading center funded by the National Science Foundation (NSF). Together, the collaboration aims to achieve regular disk-to-disk, end-to-end transfer rates of one petabyte per week between major facilities, which translates to achievable throughput rates of about 15 Gbps on real world science data sets.

HPC-Facility-DTNs-2016-Mar
Performance data from March 2016 showing transfer rates between facilities. (Image credit: Eli Dart, ESnet)

Research projects such as cosmology and climate have very large (multi-petabyte) datasets and scientists typically compute at multiple HPC centers, moving data between facilities in order to take full advantage of the computing and storage allocations available at different sites.

Since data transfers traverse multiple networks, the slowest link determines the overall speed. Tuning the data transfer nodes and the border router where a center’s internal network connects to ESnet can smooth out virtual speedbumps. Because transfers over the wide area network have high latency between sender and receiver, getting the highest speed requires careful configuration of all the devices along the data path, not just the core network.

In the past few weeks, the project has shown sustained data transfers at well over the target rate of 1 petabyte per week. The number of sites with this base capability is also expanding, with Brookhaven National Laboratory in New York now testing its transfer capabilities with encouraging results. Future plans including bringing the NSF-funded San Diego Supercomputer Center and other big data sites into the mix.

“This increase in data transfer capability benefits projects across the DOE mission science portfolio” said Eli Dart, an ESnet network engineer and leader of the project. “HPC facilities are central to many collaborations, and they are becoming more important to more scientists as data rates and volumes increase. The ability to move data in and out of HPC facilities at scale is critical to the success of an ever-growing set of projects.”

When it comes to moving data, there are many factors to consider, including the number of transfer nodes and their speeds, their utilization, the file systems connected to these transfer nodes on both sides, and the network path between them, according to Daniel Pelfrey, a high performance computing network administrator at the Oak Ridge Leadership Computing Facility.

The actual improvements being made range from updating software on the DTNs to changing the configuration of existing DTNs to adding new nodes at the centers.

HPC-Facility-DTNs-Nov-2017
Performance measurements from November 2017 at the end of the Petascale DTN project. All of the sites met or exceed project goals. (Image Credit: Eli Dart, ESnet)

“Transfer node operating systems and applications need to be configured to allow for WAN transfer,” Pelfrey said. “The connection is only going to be as fast as the slowest point in the path allows. A heavily utilized server, or a misconfigured server, or a heavily utilized network, or heavily utilized file system can degrade the transfer and make it take much longer.”

At NERSC, the DTN project resulted in adding eight more nodes, tripling the number, in order achieve enough internal bandwidth to meet the project’s goals. “It’s a fairly complicated thing to do,” said Damian Hazen, head of NERSC’s Storage Systems Group. “It involves adding infrastructure and tuning as we connected our border routers to internal routers to the switches connected to the DTNs. Then we needed to install the software, get rid of some bugs and tune the entire system for optimal performance.”

The work spanned two months and involved NERSC’s Storage Systems, Networking, and Data and Analytics Services groups, as well as ESnet, all working together, Hazen said.

At the Argonne Leadership Computing Facility, the DTNs were already in place and with minor tuning, transfer speeds were increased to the 15 Gbps.

“One of our users, Katrin Heitmann, had a ton of cosmology data to move and she saw a tremendous benefit from the project,” said Bill Allcock, who was director of operations at the ALCF during the project. “The project improved the overall end-to-end transfer rates, which is especially important for our users who are either moving their data to a community archive outside the center or are using data archived elsewhere and need to pull it in to compute with it at the ALCF.”

As a result of the Petascale DTN project, the OLCF now has 28 transfer nodes in production on 40-Gigabit Ethernet. The nodes are deployed under a new model—a diskless boot—which makes it easy for OLCF staff to move resources around, reallocating as needed to respond to users’ needs.

“The Petascale DTN project basically helped us increase the ‘horsepower under the hood’ of network services we provide and make them more resilient,” said Jason Anderson, an HPC UNIX/storage systems administrator at OLCF. “For example, we recently moved 12TB of science data from OLCF to NCSA in less than 30 minutes. That’s fast!”

Anderson recalled that a user at the May 2017 OLCF user meeting said that she was very pleased with how quickly and easily she was able to move her data to take advantage of the breadth of the Department of Energy’s computing resources.

“When the initiative started we were in the process of implementing a Science DMZ and upgrading our network,” Pelfrey said. “At the time, we could move a petabyte internally in 6-18 hours, but moving a petabyte externally would have taken just a bit over a week. With our latest upgrades, we have the ability to move a petabyte externally in about 48 hours.”

The fourth site in the project is the NSF-funded NCSA in Illinois, where senior network engineer Matt Kollross said it’s important for NCSA, the only non DOE participant, to collaborate with other DOE HPC sites to develop common practices and speed up adoption of new technologies.

“The participation in this project helped confirm that the design and investments in network and storage that we made when building Blue Waters five years ago were solid investments and will help in the design of future systems here and at other centers,” Kollross said. “It’s important that real-world benchmarks which test many aspects of an HPC system, such as storage, file systems and networking, be considered in evaluating overall performance of an HPC compute system and help set reasonable expectations for scientists and researchers.”

Origins of the project

The project grew out of a Cross-Connects Workshop on “Improving Data Mobility & Management for International Cosmology,” held at Berkeley Lab in February 2015 and co-sponsored by ESnet and Internet2.

Salman Habib, who leads the Computational Cosmology Group at Argonne National Laboratory, gave a talk at the workshop, noting that large-scale simulations are critical for understanding observational data and that the size and scale of simulation datasets far exceed those of observational data. “To be able to observe accurately, we need to create accurate simulations,” he said.

During the workshop, Habib and other attendees spoke about the need to routinely move these large data sets between computing centers and agreed that it would be important to be able to move at least a terabyte a week. As the Argonne lead for DOE’s High Energy Physics Center for Computational Excellence project, Habib had been working with ESnet and other labs on data transfer issues.

To get the project moving, Katrin Heitmann, who works in cosmology at Argonne, created a data package of small and medium files totaling about 4.4 terabytes. The data would then be used to test network links between the leadership computing facilities at Argonne and Oak Ridge national labs, the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois in Urbana-Champaign, a leading center funded by the National Science Foundation.

“The idea was to use the data as a test, to send it over and over and over between the centers,” Habib said. “We wanted to establish a performance baseline, then see if we could improve the performance by eliminating any choke points.”

Habib admitted that moving a petabyte in a week would only use a fraction of ESnet’s total bandwidth, but the goal was to automate the transfers using Globus Online, a primary tool for researchers accessing high performance networks like ESnet for rapidly sharing data or to use remote computing facilities.

“For our research, it’s very important that we have the ability to transfer large amounts of data,” Habib said. “For example, we may run a simulation at one of the large DOE computing centers, but often where we run the simulation is not where we want to do the analysis. Each center has different capabilities and we have various accounts at the centers, so the data gets moved around to take advantage of this. It happens all the time.”

Although the project’s roots are in cosmology, the Petascale DTN project will help all DOE scientists who have a need to transfer data to, from, or between the DOE computing facilities to take advantage of rapidly advancing data analytics techniques. In addition, the increase in data transfer capability at the HPC facilities will improve the performance of data portals, such as the Research Data Archive at the National Center for Atmospheric Research, that use Globus to transfer data from their storage systems.

“As the scientists deal with data deluge and more research disciplines depend on high-performance computing, data movement between computing centers needs to be a no-brainer for scientists so they can take advantage of the compute cycles at all DOE Office of Science user facilities and the extreme heterogeneity of systems in the future” said ESnet Director Inder Monga.

This work was supported by the HEP Center for Computational Excellence. ESnet is funded by DOE’s Office of Science.

 

ESnet Renews, Upgrades Transatlantic Network Connections


ESnet_Final_Logos_All_Wordmark_SpellOut_RGBThree years after ESnet first deployed its own transatlantic networking connection, the project is now being upgraded to four 100 gigabits-per-second links. These links gives researchers at America’s national laboratories and universities ultra-fast access to scientific data from the Large Hadron Collider (LHC) and other research sites in Europe.

The original configuration that went into service in December 2014 consisted of three 100 Gbps and one 40 Gbps links. Since December 2014, the LHC traffic being carried by ESnet alone has grown nearly 1600%, from 1.7 Petabytes per month in January 2015, to nearly 30 Petabytes per month in August 2017.

The four new connections link peering points in New York City and London, Boston and Amsterdam, New York and London, and Washington, D.C. and CERN in Switzerland. The contracts are with three different telecom carriers.

“Our initial approach was to build in redundancy in terms of both infrastructure and vendors and the past three years proved the validity of that idea,” said ESnet Director Inder Monga. “So, we stuck with those design principles while upgrading the fourth link to 100G.”

Overall goals of the new agreements accomplished:

  • Increase in overall capacity to meet projected demand
  • Reduction in the overall cost
  • Increase in the diversity of the cable systems providing ESnet circuits, and
  • Maintain as much R&E network community transatlantic cable diversity as possible, including that of the Advanced North Atlantic Collaboration.

Another new component is a collaboration with Indiana University funded by the National Science Foundation with its Networks for European, American and African Research (NEAAR) award within the International Research Network Connections (IRNC) program. The goal of NEAAR is to make science data from Africa, such as that collected by the Square Kilometer Array, and Europe, like data from CERN’s Large Hadron Collider, available to a broader research community.

With the upgrade, the total transatlantic capacity for Research and Education networks  is now 800 Gbps, continuing the close collaboration between the seven partners providing transatlantic connectivity under the broader umbrella of the Global Network Architecture Initiative (GNA).

ESnet’s Science DMZ Design Could Help Transfer, Protect Medical Research Data


As medicine becomes more data-intensive, Medical Science DMZ eyed as secure solution

Like other sciences, medical research is generating increasingly large datasets as doctors track health trends, the spread of diseases, genetic causes of illness and the like. Effectively using this data for efforts ranging from stopping the spread of deadly viruses to creating precision medicine treatments for individuals will be greatly accelerated by the secure sharing of the data, while also protecting individual privacy.

In a paper published Friday, Oct. 6 by the Journal of the American Medical Informatics Association, a group of researchers led by Sean Peisert of the Department of Energy’s (DOE) Lawrence Berkeley National Laboratory (Berkeley Lab) wrote that the Science DMZ architecture developed for moving large data sets quick and securely could be adapted to meet the needs of the medical research community.

“You can’t just take the medical data from one site and drop it straight in to another site because of the policy constraints on that data,” said Eli Dart, a network engineer at the Department of Energy’s Energy Sciences Network (ESnet) who is a co-author of the paper. “But as members of a society, our health could benefit if the medical science community can become more productive in terms of accessing relevant data.”

Read the full story.

Medical Science DMZ
Schematic showing components of the Medical Science DMZ.

ESnet Congratulates the LIGO Visionaries on their 2017 Nobel Prize in Physics


ESnet congratulates Barry Barish and Kip Thorne of Caltech and Rainer Weiss of MIT on receiving the 2017 Nobel Prize in Physics for their vision and leadership of the LIGO Laboratory. Their discovery of gravitational waves, made just two years ago, culminates decades of effort. ESnet is proud to have played a role in supporting this achievement.

LIGO’s Hanford facility in Washington was an early adopter of ESnet’s OSCARS, the On-Demand Secure Circuits and Advance Reservation System, for guaranteed bandwidth services in 2005 for early development. In fact, the project was one of the very first users of the OSCARS service.

Last year, ESnet upgraded the Hanford LIGO sites network connection to Seattle with a dedicated 10 Gbps link, which complemented a shared 10 Gbps link to Boise. The Hanford site consistently moves about 400 megabits of data per second to Caltech in Southern California.

LIGO_Hanford_3C
ESnet provides 10 Gbps connectivity to the Hanford LIGO Observatory in southeast Washington, linking the site to Caltech and the international research community.

You can see the real-time data transfer rates and other details of this connection on the MyESnet portal.

Lastly, we are also working with Caltech to improve end-to-end bandwidth at the campus as part of theascr-funded SENSE (SDN for End-to-end Networked Science at the Exascale) project. By improving scientific workflows and end-site driven intelligent services to increase data throughput, the project will help LIGO in using high throughput data transfer methods.

Again, congratulations to our LIGO colleagues and we look forward to continuing to support your research mission.

Read how Berkeley Lab’s distributed computing experts developed software to help the LIGO manage the distribution of data from the experiment.

For a great explanation of the LIGO project, read this NASA Jet Propulsion Laboratory blog.

ESnet, Internet2 Renew Critical Exchange Point Contract with NYSERNet


esnet_internet2

The Department of Energy’s Energy Sciences Network (ESnet) and Internet2 — two of the nation’s leading research and education networks — today announced the renewal of an agreement to remain anchor tenants at one of the world’s most critical Internet exchange points operated by the New York State Education and Research Network (NYSERNet).

Located at 32 Avenue of the Americas in Manhattan, NYSERNet’s “32AofA” global network exchange is a well-known international hub where the world’s leading research and education networks connect to content, data and telecom providers to seamlessly exchange traffic among their networks.

“One network alone cannot connect every scientist in the lab, every student in a classroom or every researcher in the field. By creating a rich interconnected fabric of networks, we are able to bring together the best ideas, minds and scientific resources no matter where in the world they may be,” said Inder Monga, director of ESnet and the Scientific Networking Division of Lawrence Berkeley National Laboratory. “This is how discovery in the era of big data will take place. We appreciate our continuing partnership with NYSERNet and Internet2 to provide this truly critical network connection.”

Read the full story.

ESnet’s Mariam Kiran Earns DOE Early Career Award


Mariam Kiran
Mariam Kiran

Mariam Kiran, a research scientist in the Energy Sciences Network’s (ESnet’s) Advanced Network Technologies Group, has received a 2017 Early Career Research Program award from the Department of Energy’s (DOE’s) Office of Science. Now in its eighth year, the award supports exceptional researchers during critical stages of their formative work by funding their research for five years.

Kiran will use her award to advance the state of the art in network research. She will employ methods from machine-learning and parallel computing to work with network research, to optimize traffic and path allocation.

“Networking is an interesting field utilizing multiple hardware and software skills. For example, configuring links involve understanding current network topologies, as well as, anticipating traffic demands and user requirements. ESnet has already been at a forefront of networking research with advanced monitoring tools and network expertise,” Kiran said. “However, as networks grow and become more complex, we have to find new methods to automate some or all of current network tasks. These include anticipating problems in advance and automating the ‘fixes’ to maintain a healthy network environment.”

Mariam@SC16.2
Mariam Kiran’s demo on “InDI: Intent-based User-defined Service Deployment over Multi-Domain SDN applications” was one of the more popular sessions in the DOE booth at SC16.

Read the full story.

Patrick Dorn to Lead ESnet’s Network Engineering Group


XBD201103-00163.jpg

Patrick Dorn, a network engineer who joined ESnet in 2011, has been named the new leader of ESnet’s Network Engineering Group. He has held the job in an acting capacity since last September.

During his time with ESnet, Dorn spent a year at CERN in Switzerland, working to establish high speed links between CERN and the U.S. research community.

Before joining ESnet, Dorn was a senior network engineer at the National Center for Supercomputing Applications in Urbana-Champaign, Illinois. At NCSA he held both technical and management roles.

While at NCSA, Dorn served as the SC)08 conferencechair for SCinet, the high-speed network that provides wired and wireless connectivity for the thousands of attendees. SCinet is entirely volunteer-driven and takes more than a year to plan and then deploy.

SLAC, AIC and Zettar Move Petabyte Datasets at Unprecedented Speed via ESnet


Twice a year, ESnet staff meet with managers and researchers associated with each of the DOE Office of Science program offices to look toward the future of networking requirements and then take the planning steps to keep networking capabilities out in front of those demands.

Network engineers and researchers at DOE national labs take a similar forward-looking approach. Earlier this year, DOE’s SLAC National Accelerator Laboratory (SLAC) teamed up with AIC and Zettar and tapped into ESnet’s 100G backbone network to repeatedly transfer 1-petabyte files in 1.4 days over a 5,000-mile portion of ESnet’s production network. Even with the transfer bandwidth capped at 80Gbps, the milestone demo resulted in transfer rates five times faster than other technologies. The demo data accounted for a third of all ESnet traffic during the tests. Les Cottrell from SLAC presented the results at the ESnet Site Coordinators meeting (ESCC) held at Lawrence Berkeley National Laboratory in May 2017.

zettar-1024x780

The test loop ran from 5,000-mile loop that goes from Department of Energy’s SLAC National Accelerator Laboratory (SLAC) in Menlo Park, Calif. across the country to Atlanta and then back to SLAC. The data transfers are part of the experiment to handle expected amounts of data generated by experiments at SLAC’s planned Linear Coherent Light Source II ( LCLS-II).

“Collaborations like this provide the networking community with an opportunity to use a production network for testing new technologies and seeing how they perform in a real-world scenario,” said ESnet Director Inder Monga. “At the same time, ESnet also gets to learn about leading-edge products as part of our future planning process.”

Read the AICCI/Zettar news release.

Read the story in insideHPC.

ESnet’s John Paul Jones Caps 33-Year Career at National Labs


XBD200201-00011.PSD
John Jones in his signature blue beret.

Soon after John Paul Jones moved from Idaho to California in 1983, he and his wife visited the Berkeley Hat Company, where he bought a royal blue beret. Since then, during his 33+ years at Lawrence Livermore and Lawrence Berkeley national labs, the flat blue hat has become part of Jones’ persona.

But when he retires from ESnet at the end of June 2017, Jones said he may also think about hanging up that hat. Around the house, he said, he usually wears his blue and gold Golden State Warriors cap.

In 1995, the Department of Energy made the decision to move ESnet and NERSC from Livermore to Lawrence Berkeley National Laboratory. Jones knew people who were part of the ESnet team at Livermore and it piqued his interest when ESnet’s then-manager Jim Leighton called him in to talk about joining the group.

“He unrolled this big network map and showed it to me,” Jones recalled. “I said, ‘What!? Oh yeah – I am definitely in!’”

When ESnet made the move in 1996, Jones joined the group that configured, installed, maintained and did troubleshooting on the routers that powered the national network.

Read more about JP’s life and career.

How Brian Tierney’s “Aha moment” Turned into a 28-year Career at Berkeley Lab, ESnet


XBD201101-00026
Brian Tierney

As he prepares to retire this month after more than 28 years at Lawrence Berkeley National Laboratory, Brian Tierney, head of ESnet’s Advanced Network Technologies Group, still remembers the exact moment when he knew where his career path would lead.

“I met Bill Johnston at San Francisco State and on the very first day of his Computer Graphics class, he told us ‘Anybody who gets and A in my class gets an internship in my group,’” Tierney recalled. “A light bulb went off and I knew I was going to get an A. I literally thought “That’s what I might do for the next 30 years.’”

He started in Johnston’s Graphics Group as a graduate student assistant in 1988 and a year later Tierney had become a career staff member.

Among the key projects Tierney has either contributed to are perfSONAR, the network performance toolkit, and fasterdata.es.net, a collection of tips and tools for, well, faster data transfers.

tierney.97
An iconic photo of Brian circa 1998

Read more about Brian’s career and retirement plans.