Earn a degree in the field of data science these days and your ticket is punched: Google, Amazon, Facebook, leading-edge academic research, a well-funded startup—they’re all clamoring for people proficient in the tools and techniques needed to sift through today’s endless streams of digital data in search of something valuable.
Social service organizations and local governments are confronting the data deluge, too, often without the capacity to pay the salaries that profit-driven companies can offer these sought-after experts.
Enter the University of Washington’s just-concluded Data Science for Social Good summer internship. The program set interdisciplinary student teams, guided by professional data scientists and subject-matter experts, to work on thorny, real-world urban problems including family homelessness, paratransit bus service, community well-being, and sidewalk mapping for accessible route planning.
During their final presentations last week, four student teams showed off tools they built over the summer that should provide lasting value to the organizations whose data they worked with, and the community at large. In sharing their process, the teams also highlighted the challenges inherent in drawing insight from big data.
One team, working with the Bill & Melinda Gates Foundation and Building Changes, sought to parse data from King, Pierce, and Snohomish counties on family homelessness. The nonprofits and the counties are in the midst of a multi-year initiative aimed at making family homelessness rare, brief, and one-time.
Each county uses a federally mandated system to track family homelessness, but there are differences in the way they enter data, count what constitutes a family, and define an episode of homelessness. This presented the DSSG team with a classic data-wrangling problem as they tried to look for factors that lead to families successfully moving out of homelessness programs and in to permanent housing.
“We spent the bulk of the summer trying to find ways to process the data into an analyzable format,” said Joan Wang, one of the DSSG interns, during her team’s presentation.
They used clustering algorithms to better define and identify individual households within the anonymous county data. They reviewed literature and consulted with county experts to create a uniform definition of a single episode of homelessness. A family might enroll in multiple programs that overlap—such as emergency shelter followed by rapid re-housing—and would show up as multiple entries in the tracking system. By aggregating these events into a single episode, the data better matched the reality of a family’s experience.
In the end, the team fed the processed data into an interactive diagram that illustrates the flow of families through the system, visualizing the individual programs that contributed to successful exits from homelessness. (It’s a Sankey diagram, commonly used to chart the flow of energy through an economy. You can check it out here.)
“Generally, nonprofits don’t have the capacity to do anything more complicated than a regression analysis, so the machine learning and decision trees (which are used by the for-profit sector) were leaps and bounds more advanced than what we’re used to seeing and provided a huge benefit to the counties,” said Anne Martens of the Gates Foundation via e-mail. “The project allowed the counties to look at the data in new ways, which has already influenced their decision-making process.”
Data Science in Transit
Another team delved into data from King County Metro’s Paratransit service, which provides on-demand, door-to-door transportation for people whose disabilities prevent them from using regularly scheduled bus service. Fares for paratransit service, mandated by the Americans with Disabilities Act, cannot be more than double the fares on regular busses, but the service can cost 10 times as much to provide. The program is funded from the same bucket of money that funds regular bus service. As such, reducing costs to operate paratransit benefits all transit riders in King County.
The team (pictured at top) sought to help King County Metro better predict