Setem Technologies Gets $1.7M to Decode the “Cocktail Party Problem”

What does Woody Guthrie have in common with next-generation voice-recognition technology that could improve your smartphone or hearing aid? Plenty, if a small Massachusetts company has anything to say about it.

I’m talking about Setem Technologies, an angel-funded startup based outside of Boston. Quite a bit outside—the offices are in an unassuming building off Route 95, a spot that CEO Anthony Cirurgiao laughingly calls “the middle of nowhere, Massachusetts.”

But don’t let the location fool you. After quietly building its technology and its intellectual property footprint, Setem recently added $1.7 million from a family investment office and several angel investors, on top of the roughly $1.4 million it raised last year.

And the startup is now starting to make some noise about what it’s been working on.

Setem specializes in a technology called blind-source signal separation. In plain English, that means being able to pull a desired stream of sound—in this case, individual voices—out of the soup of noises and cross-talk happening all around in the real world. (The startup’s name is inspired by the ancient Egyptian deity associated with hearing.)

Setem says this is “the cocktail party problem,” straining to zero in on one speaker above a din of chatter and clinking dishes

Humans can do that pretty well, provided their hearing is up to par. Computers and microphones still struggle, though—a problem you may have noticed while trying to dictate voice directions to a smartphone while the car radio is playing.

That’s partly because the technology for separating targeted voices from other speakers and the rest of the real-world noise has remained essentially as it was in the early 2000s, Cirurgiao says.

Companies working on voice-recognition hit a wall with what their voice-collection methods could do, and instead turned to computational approaches to better decipher what the speaker was trying to say.

“It’s a problem that people pretty much gave up on,” he says.

Not everyone gave up, though. In Durham, NH, University of New Hampshire professor Kevin Short was working away on mathematical models that could be used to pull speech from background noise with much greater clarity.

That’s where Woody Guthrie comes into play.

Short was part of a team that helped save the only known recording of the folk legend performing live. The concert—which includes songs and several spoken asides during the performance—was recorded on wire, a long-ago abandoned method.

Short’s work helped turn the soupy-sounding recording into a crisp version that was released as an album. The work earned Short and the rest of the team a Grammy.

The techniques used in the Guthrie project were an early version of the technology that has turned into Setem’s product offering (Short is a co-founder). The company is now showing off its software to other businesses, hoping to attract customers in healthcare, mobile devices, and software applications. One big key, Setem says, is that its software-based approach won’t add processors or microphones to the bundle of electronics that smartphone makers already have in their products.

Setem’s rise comes at a time when voice interactions are starting to really matter for big technology companies and others. The introduction of Apple’s Siri voice-controlled virtual assistant has energized the voice and speech sector, driving interest in Boston-area companies—particularly Nuance—and talent, especially from MIT.

Setem works by mathematically identifying all of the various strains of sound that are mixed together in a bit of everyday conversation in a crowded room, or out on the street. The startup says that allows its software to show users a much finer-grained level of detail, akin to using a high-definition camera rather than a Polaroid.

On a computer screen, the difference is pretty dramatic. Typical software today might give a sound engineer a broad set of wave-looking images of different intensity, simillar in appearance to heat signatures from an infrafred camera.

To zoom in on the desired strand, you’re usually left with blunt-instrument approaches like chopping off parts of the track that don’t contain what you’re looking for. That can clip bits of the voice you’re trying to get, too, Cirurgiao says—and isn’t a guarantee that all of the background noise is gone.

Setem says its software goes way beyond that level of detail, making the output on a screen look more like individual threads of sound that can be easily isolated.

“Kevin found a way to look at signals in this super-resolution that enables you to look at signals at anywhere up to five orders of magnitude better resolution than the best tools today,” Cirurgiao says.

By getting to that fine level of detail, individual sounds can be turned into discrete streams of information that are separated, manipulated, and put back together.

So instead of cutting background noise or other speakers out of the equation, you can effectively turn up the volume on desired sounds (the person you want to hear speaking), and turn down the volume of the rest (a guy skateboarding past you, or tourists arguing on the street corner).

In fact, Setem uses those exact scenarios to demonstrate its product. They’ve spent some quality time out in the noisiest parts of the Boston area, including the tourist and traffic crush along Boylston Street on a busy day.

You kind of have to hear it to get the real idea, but Cirurgiao wouldn’t let me embed a sample here (the demo versions are older, he said, and the product has improved since then). But they were pretty impressive—it sounded like you’d turned down the volume on the rest of the city, zeroing in on one person’s voice.

Setem is still in its very early stages. Cirurgiao tempers the usual startup CEO’s optimism with a bit of caution, but insists that Setem is already getting some notice from the industry—not hard to imagine with the importance of voice commands in the future of computing.

“We haven’t trumpeted it, but some of the big guys out there we’ve already been working with,” he says. “We’ve actually had the CTOs of a couple of public companies trek all the way to our office in the middle of nowhere, Massachusetts.”

Author: Curt Woodward

Curt covered technology and innovation in the Boston area for Xconomy. He previously worked in Xconomy’s Seattle bureau and continued some coverage of Seattle-area tech companies, including Amazon and Microsoft. Curt joined Xconomy in February 2011 after nearly nine years with The Associated Press, the world's largest news organization. He worked in three states and covered a wide variety of beats for the AP, including business, law, politics, government, and general mayhem. A native Washingtonian, Curt earned a bachelor's degree in journalism from Western Washington University in Bellingham, WA. As a past president of the state's Capitol Correspondents Association, he led efforts to expand statehouse press credentialing to online news outlets for the first time.