Giving voice to the voiceless: VocaliD’s exciting next chapter

This post was originally published by Milton Posner for Khoury College of Computer Sciences.

The ‘a-ha’ moment was someone else’s conversation.

At an assistive technology conference, Rupal Patel spotted a little girl and a grown man chatting in identical voices that were not their own. Both were speech impaired, and like the hundreds of other attendees, they were speaking through assistive devices that offered only a few generic, computerized voices.

The precise timbre and tone of our voices reflect our bodies, our upbringings, our thinking — in short, our identities. For millions of speech-impaired people worldwide, that expression of identity is gone … unless someone builds it for them.

In the years since that conference, Patel and her collaborators have worked tirelessly to harness whatever residual vocal abilities a person has left, combine them with audio supplied by voice volunteers, and produce a customizable voice that actually sounds like the person using it.

“This was use-inspired research. People couldn’t speak, and as a clinician, I wanted them to have a voice that sounded like them and not like an ATM,” Patel said. “That’s seeing a problem and saying, ‘technology can help us fix this.’”

After seven years in the lab and another eight as the company VocaliD, the operation was recently acquired by AI platform Veritone. The acquisition begins a promising new chapter for a research effort that has maintained the same ethos for almost two decades.

“Professor Patel’s research demonstrates how computer science improves the quality of life for people across a wide spectrum, in this case individuals dealing with speech impairments,” said Khoury College Dean Elizabeth Mynatt. “By making this technology more widely available, the voices of many are now more authentically heard.”

Speaking of which …

When the ‘a-ha’ moment arrived in 2002, Patel was uniquely suited to seize it. She had studied speech and speech disorders since her doctoral days at the University of Toronto in the late ‘90s, and was a year away from joining Northeastern University. She was awarded a National Science Foundation grant in 2007 to work on personalized synthetic voices for speech-impaired people, work she quickly began in collaboration with Khoury College student Michael Everett. Here, her interdisciplinary appointment between the Khoury College of Computer Sciences and the Bouvé College of Health Sciences proved essential. The interdisciplinary approach is woven into the fabric of each college, reflected in their academic programs, faculty appointments, and research labs.

“We’ve always had students from both fields interacting together,” Patel noted. “This project doesn’t happen if you’re just in the silo of health science or just in the silo of AI. We were bringing together students from different places to help solve big problems.”

11/05/19 – BOSTON, MA. – Rupal Patel, joint professor in the department of speech language pathology and audiology in Bouve College and the College of Computer and Information Science, on November 5, 2019. Photo by Ruby Wallau/Northeastern University

And the problems were indeed big. While the idea of crafting synthetic voices had existed for generations, it had always required a deep vault of audio recordings to use as building blocks.

“What was different about VocaliD was doing that without lots of audio,” Patel said. “Someone with a speech impairment can’t give you much audio, and it’s not good or clear audio. We figured out how to take whatever the person could still do with their voice so it would sound somewhat like them, and then find a voice donor similar to them in age and accent and other things, who would record a larger dataset. Then we’d blend those voices together.”

This sort of voice mixing was novel and workable, but incredibly resource intensive. The VocaliD team used concatenative synthesis, an audio clipping and pasting process which required countless lab hours and hundreds of thousands of dollars — all to produce one voice.

“In the last seven or eight years, massive advances in machine learning have revitalized the whole speech synthesis field,” Patel said. “Now you can make a synthetic voice with less data that sounds far more natural, and that’s controllable and configurable. It’s a categorically different, completely new technique.”

Any boost in efficiency was welcome for the team, which had aimed to balance company profitability and accessibility since VocaliD’s founding in 2014. The original mission was purely social; users paid for the voices, albeit with plenty of subsidy from National Science Foundation and National Institutes of Health grants. Then, about two years ago, VocaliD began its purely commercial voice talent enterprise, in which voice actors record a few hours of content, VocaliD builds and licenses the AI voices, and both parties profit. During the recent acquisition talks, Patel ensured that commercial considerations wouldn’t crowd out the social mission.

“We always wanted to make diverse, inclusive voices for people who needed it the most, people who aren’t heard,” Patel said. “That’s the essence of who we are; it’s why we’ve won awards and been a market leader. We’re not just compromising the use purpose for scale. It was really important not to sell the company to just anybody, but to find the right fit.”

Target acquired

The right fit, as it turned out, was Veritone, which offers a truckload of award-winning, AI-powered services across a wide range of industries, from energy to finance to retail.

“What was acquired was the entire company, and the entire ethos of the company,” Patel said. “It was really important, both for me and for them, that we didn’t abandon the social piece. Veritone does other social good projects with AI, so it fits their ethos as well.”

The two companies first collaborated to create a voice model of legendary news anchor Walter Cronkite for use in educational programs. The project helped Veritone understand what VocaliD could do, and later, what VocaliD could provide for Veritone: greater control over the voice creation lifecycle, and thus better efficiency, scale, and voice offerings. Veritone was impressed, and brought the eight-year-old company on board.

“I’m excited to see that VocaliD now has the home, capabilities, and research to reach a much larger platform,” Patel said, “both for the social piece and the commercial piece without which the social mission could never survive.”

VocaliD showcases its tech at the ISAAC Conference 2016 in Toronto.

In her new role as vice president of voice and accessibility, Patel can work to balance the commercial with the social. She’ll be joined on Veritone’s Commercial Enterprise team by the rest of the VocaliD crew, including recent Khoury College graduate Teo Boley. Boley’s role will primarily focus on integrating PARROT STUDiO — the VocaliD AI-generated audio creation platform for which he was lead developer — into Veritone’s suite.

The acquisition hinges on the continuity of both VocaliD’s projects and its principles. Among those principles is one Patel emphasizes in part because so many venture capitalists and angel investors turned her away when she pitched it years ago: universal design.

“If you create something that helps a small group of people with a special need, usually it will have something we can all benefit from,” Patel said. “I think VocaliD is a perfect example of that. If we focus only on problems for the mass majority, we leave behind a bunch of people.”

“I think people don’t give enough importance to the fact that if you can help those in real need, we can change the wider world,” she added. “I think that’s the story and the spirit of Khoury and Bouvé coming together.”