comparing the evolution of galaxies simulated on supercomputers with actual observations of the Milky Way from the Sloan Digital Sky Survey (SDSS) to learn about processes by which our own galaxy might have formed. For example, she is currently comparing the distribution of iron and oxygen in stars at the outer edges, or stellar halo, of simulated galaxies with the observed iron content of stars in the Milky Way.
“We think that the iron content of these stars and how it varies can tell us something about how the halo was formed, and how many small galaxies [the Milky Way] consumed to build itself up in the process,” she says.
The simulations produce hundreds of gigabytes to terabytes of data, which then have to be compared to similar amounts from the SDSS.
Loebman worked with data management experts from the UW computer science department. She became skilled in data science as she finished her PhD in astronomy at the UW. It’s that multi-disciplinary expertise, she says, that helped her land a post-doctoral position at the University of Michigan.
Loebman exemplifies the new scientist that the Moore/Sloan grant aims to help universities train in a more deliberate way. She spent the summer working with senior leaders in a range of scientific fields at the participating universities to help structure the program. They discussed common data-management approaches across disciplines, which tools can and can’t be shared among biologists and astronomers, for example.
“It links all these different departments together in the tools we use,” she says.
While Loebman has found a career path that values her multi-disciplinary approach, that is a relatively new thing, even in astronomy, which is a leader in embracing data-driven science. It has to be, as the datasets astronomers have to work with are accelerating in size and complexity as fast as any:
The Sloan Digital Sky Survey—which began in 2000 and is funded by the Sloan Foundation, among others—produces about 200 gigabytes of raw data each night, adding up to roughly 300 terabytes total over the course of the survey. The Large Synoptic Survey Telescope, expected to begin operating in the early 2020s, would capture that much raw data over the course of a couple of weeks. Billed as the world’s largest digital camera, capable of taking “a color movie of the universe,” the LSST over its 10-year run could gather as much as 75 petabytes of raw data.
Recently, Loebman says, astronomy has started to look beyond scientific journal publications as the only measure of success.
“There is a need to expand what is valued,” she says.
Increasingly, astronomers might be recognized for a blog post that shares a data-management approach or the release of code to GitHub. The field, she says, is “now valuing the development of computational tools that help people analyze their data.”
Lazowska says universities need to take other steps that create attractive career paths for professional data scientists, who also have lucrative opportunities in the private sector. He compares this to the professional engineers at the UW’s Applied Physics Laboratory, who build instruments, undersea vehicles, software, and tools to support oceanographers and other researchers.
“They could get jobs anywhere,” Lazowska says. “We need the same thing in data science… We want to be in a position to get people to apply their data science skills to discovery as opposed to clicking on ads.”
The Moore/Sloan funding is impacting this directly, providing money to pay for new positions. Competition for people with these skills is fierce. According to a 2011 McKinsey Global Institute report on big data: “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
Joe Hellerstein, a veteran of IBM Research, Microsoft, and Google, where he helped lead the “Google for Science” effort, will become a data science fellow at the UW beginning next year. He will continue working half-time at Google.
Though he isn’t a chemist by training, and hadn’t had a chemistry course in decades, Hellerstein has been teaching a biochemistry seminar for computer scientists at UW. In preparing the course, he was struck by the similarities between biological molecules—with their thousands of atoms, complex structures, and relationships—and software. Specifically, he says, the Unified Modeling Language, a software language for modeling object-oriented systems, is a potentially useful way to describe biochemical structures and transformation pathways.
That kind of insight could help create