How a New AI Translated Brain Activity to Speech With 97 Percent Accuracy

By Edd Gent

The idea of a machine that can decode your thoughts might sound creepy, but for thousands of people who have lost the ability to speak due to disease or disability it could be game-changing. Even for the able-bodied, being able to type out an email by just thinking or sending commands to your digital assistant telepathically could be hugely useful.

That vision may have come a step closer after researchers at the University of California, San Francisco demonstrated that they could translate brain signals into complete sentences with error rates as low as three percent, which is below the threshold for professional speech transcription.

While we’ve been able to decode parts of speech from brain signals for around a decade, so far most of the solutions have been a long way from consistently translating intelligible sentences. Last year, researchers used a novel approach that achieved some of the best results so far by using brain signals to animate a simulated vocal tract, but only 70 percent of the words were intelligible.

The key to the improved performance achieved by the authors of the new paper in Nature Neuroscience was their realization that there were strong parallels between translating brain signals to text and machine translation between languages using neural networks, which is now highly accurate for many languages.

While most efforts to decode brain signals have focused on identifying neural activity that corresponds to particular phonemes—the distinct chunks of sound that make up words—the researchers decided to mimic machine translation, where the entire sentence is translated at once. This has proven a powerful approach; as certain words are always more likely to appear close together, the system can rely on context to fill in any gaps.

The team used the same encoder-decoder approach commonly used for machine translation, in which one neural network analyzes the input signal—normally text, but in this case brain signals—to create a representation of the data, and then a second neural network translates this into the target language.

They trained their system using brain activity recorded from 4 women with electrodes implanted in their brains to monitor seizures as they read out a set of 50 sentences, including 250 unique words. This allowed the first network to work out what neural activity correlated with which parts of speech.

In testing, it relied only on the neural signals and was able to achieve error rates of below eight percent on two out of the four subjects, which matches the kinds of accuracy achieved by professional transcribers.

Inevitably, there are caveats. Firstly, the system was only able to decode 30-50 specific sentences using a limited vocabulary of 250 words. It also requires people to have electrodes implanted in their brains, which is currently only permitted for a limited number of highly specific medical reasons. However, there are a number of signs that this direction holds considerable promise.

One concern was that because the system was being tested on sentences that were included in its training data, it might simply be learning to match specific sentences to specific neural signatures. That would suggest it wasn’t really learning the constituent parts of speech, which would make it harder to generalize to unfamiliar sentences.

But when the researchers added another set of recordings to the training data that were not included in testing, it reduced error rates significantly, suggesting that the system is learning sub-sentence information like words.

They also found that pre-training the system on data from the volunteer that achieved the highest accuracy before training on data from one of the worst performers significantly reduced error rates. This suggests that in practical applications, much of the training could be done before the system is given to the end user, and they would only have to fine-tune it to the quirks of their brain signals.

The vocabulary of such a system is likely to improve considerably as people build upon this approach—but even a limited palette of 250 words could be incredibly useful to a paraplegic, and could likely be tailored to a specific set of commands for telepathic control of other devices.

Now the ball is back in the court of the scrum of companies racing to develop the first practical neural interfaces.

How a New AI Translated Brain Activity to Speech With 97 Percent Accuracy

Synthetic speech generated from brain recordings


Illustrations of electrode placements on the research participants’ neural speech centers, from which activity patterns recorded during speech (colored dots) were translated into a computer simulation of the participant’s vocal tract (model, right) which then could be synthesized to reconstruct the sentence that had been spoken (sound wave & sentence, below). Credit: Chang lab / UCSF Dept. of Neurosurgery

A state-of-the-art brain-machine interface created by UC San Francisco neuroscientists can generate natural-sounding synthetic speech by using brain activity to control a virtual vocal tract—an anatomically detailed computer simulation including the lips, jaw, tongue, and larynx. The study was conducted in research participants with intact speech, but the technology could one day restore the voices of people who have lost the ability to speak due to paralysis and other forms of neurological damage.

Stroke, traumatic brain injury, and neurodegenerative diseases such as Parkinson’s disease, multiple sclerosis, and amyotrophic lateral sclerosis (ALS, or Lou Gehrig’s disease) often result in an irreversible loss of the ability to speak. Some people with severe speech disabilities learn to spell out their thoughts letter-by-letter using assistive devices that track very small eye or facial muscle movements. However, producing text or synthesized speech with such devices is laborious, error-prone, and painfully slow, typically permitting a maximum of 10 words per minute, compared to the 100-150 words per minute of natural speech.

The new system being developed in the laboratory of Edward Chang, MD—described April 24, 2019 in Nature—demonstrates that it is possible to create a synthesized version of a person’s voice that can be controlled by the activity of their brain’s speech centers. In the future, this approach could not only restore fluent communication to individuals with severe speech disability, the authors say, but could also reproduce some of the musicality of the human voice that conveys the speaker’s emotions and personality.

“For the first time, this study demonstrates that we can generate entire spoken sentences based on an individual’s brain activity,” said Chang, a professor of neurological surgery and member of the UCSF Weill Institute for Neuroscience. “This is an exhilarating proof of principle that with technology that is already within reach, we should be able to build a device that is clinically viable in patients with speech loss.”

Brief animation illustrates how patterns of brain activity from the brain’s speech centers in somatosensory cortex (top left) were first decoded into a computer simulation of a research participant’s vocal tract movements (top right), which were then translated into a synthesized version of the participant’s voice (bottom). Credit:Chang lab / UCSF Dept. of Neurosurgery. Simulated Vocal Tract Animation Credit:Speech Graphics
Virtual Vocal Tract Improves Naturalistic Speech Synthesis

The research was led by Gopala Anumanchipalli, Ph.D., a speech scientist, and Josh Chartier, a bioengineering graduate student in the Chang lab. It builds on a recent study in which the pair described for the first time how the human brain’s speech centers choreograph the movements of the lips, jaw, tongue, and other vocal tract components to produce fluent speech.

From that work, Anumanchipalli and Chartier realized that previous attempts to directly decode speech from brain activity might have met with limited success because these brain regions do not directly represent the acoustic properties of speech sounds, but rather the instructions needed to coordinate the movements of the mouth and throat during speech.

“The relationship between the movements of the vocal tract and the speech sounds that are produced is a complicated one,” Anumanchipalli said. “We reasoned that if these speech centers in the brain are encoding movements rather than sounds, we should try to do the same in decoding those signals.”

In their new study, Anumancipali and Chartier asked five volunteers being treated at the UCSF Epilepsy Center—patients with intact speech who had electrodes temporarily implanted in their brains to map the source of their seizures in preparation for neurosurgery—to read several hundred sentences aloud while the researchers recorded activity from a brain region known to be involved in language production.

Based on the audio recordings of participants’ voices, the researchers used linguistic principles to reverse engineer the vocal tract movements needed to produce those sounds: pressing the lips together here, tightening vocal cords there, shifting the tip of the tongue to the roof of the mouth, then relaxing it, and so on.

This detailed mapping of sound to anatomy allowed the scientists to create a realistic virtual vocal tract for each participant that could be controlled by their brain activity. This comprised two “neural network” machine learning algorithms: a decoder that transforms brain activity patterns produced during speech into movements of the virtual vocal tract, and a synthesizer that converts these vocal tract movements into a synthetic approximation of the participant’s voice.

The synthetic speech produced by these algorithms was significantly better than synthetic speech directly decoded from participants’ brain activity without the inclusion of simulations of the speakers’ vocal tracts, the researchers found. The algorithms produced sentences that were understandable to hundreds of human listeners in crowdsourced transcription tests conducted on the Amazon Mechanical Turk platform.

As is the case with natural speech, the transcribers were more successful when they were given shorter lists of words to choose from, as would be the case with caregivers who are primed to the kinds of phrases or requests patients might utter. The transcribers accurately identified 69 percent of synthesized words from lists of 25 alternatives and transcribed 43 percent of sentences with perfect accuracy. With a more challenging 50 words to choose from, transcribers’ overall accuracy dropped to 47 percent, though they were still able to understand 21 percent of synthesized sentences perfectly.

“We still have a ways to go to perfectly mimic spoken language,” Chartier acknowledged. “We’re quite good at synthesizing slower speech sounds like ‘sh’ and ‘z’ as well as maintaining the rhythms and intonations of speech and the speaker’s gender and identity, but some of the more abrupt sounds like ‘b’s and ‘p’s get a bit fuzzy. Still, the levels of accuracy we produced here would be an amazing improvement in real-time communication compared to what’s currently available.”

Artificial Intelligence, Linguistics, and Neuroscience Fueled Advance

The researchers are currently experimenting with higher-density electrode arrays and more advanced machine learning algorithms that they hope will improve the synthesized speech even further. The next major test for the technology is to determine whether someone who can’t speak could learn to use the system without being able to train it on their own voice and to make it generalize to anything they wish to say.


Image of an example array of intracranial electrodes of the type used to record brain activity in the current study. Credit: UCSF

Preliminary results from one of the team’s research participants suggest that the researchers’ anatomically based system can decode and synthesize novel sentences from participants’ brain activity nearly as well as the sentences the algorithm was trained on. Even when the researchers provided the algorithm with brain activity data recorded while one participant merely mouthed sentences without sound, the system was still able to produce intelligible synthetic versions of the mimed sentences in the speaker’s voice.

The researchers also found that the neural code for vocal movements partially overlapped across participants, and that one research subject’s vocal tract simulation could be adapted to respond to the neural instructions recorded from another participant’s brain. Together, these findings suggest that individuals with speech loss due to neurological impairment may be able to learn to control a speech prosthesis modeled on the voice of someone with intact speech.

“People who can’t move their arms and legs have learned to control robotic limbs with their brains,” Chartier said. “We are hopeful that one day people with speech disabilities will be able to learn to speak again using this brain-controlled artificial vocal tract.”

Added Anumanchipalli, “I’m proud that we’ve been able to bring together expertise from neuroscience, linguistics, and machine learning as part of this major milestone towards helping neurologically disabled patients.”

https://medicalxpress.com/news/2019-04-synthetic-speech-brain.html

Humans couldn’t pronounce ‘f’ and ‘v’ sounds before farming developed

By Alison George

Human speech contains more than 2000 different sounds, from the ubiquitous “m” and “a” to the rare clicks of some southern African languages. But why are certain sounds more common than others? A ground-breaking, five-year investigation shows that diet-related changes in human bite led to new speech sounds that are now found in half the world’s languages.

More than 30 years ago, the linguist Charles Hockett noted that speech sounds called labiodentals, such as “f” and “v”, were more common in the languages of societies that ate softer foods. Now a team of researchers led by Damián Blasi at the University of Zurich, Switzerland, has pinpointed how and why this trend arose.

They found that the upper and lower incisors of ancient human adults were aligned, making it hard to produce labiodentals, which are formed by touching the lower lip to the upper teeth. Later, our jaws changed to an overbite structure, making it easier to produce such sounds.

The team showed that this change in bite correlated with the development of agriculture in the Neolithic period. Food became easier to chew at this point, which led to changes in human jaws and teeth: for instance, because it takes less pressure to chew softer, farmed foods, the jawbone doesn’t have to do as much work and so doesn’t grow to be so large.

Analyses of a language database also confirmed that there was a global change in the sound of world languages after the Neolithic era, with the use of “f” and “v” increasing dramatically in recent millennia. These sounds are still not found in the languages of many hunter-gatherer people today.

This research overturns the prevailing view that all human speech sounds were present when Homo sapiens evolved around 300,000 years ago. “The set of speech sounds we use has not necessarily remained stable since the emergence of our species, but rather the immense diversity of speech sounds that we find today is the product of a complex interplay of factors involving biological change and cultural evolution,” said team member Steven Moran, a linguist at the University of Zurich, at a briefing about this study.

This new approach to studying language evolution is a game changer, says Sean Roberts at the University of Bristol, UK. “For the first time, we can look at patterns in global data and spot new relationships between the way we speak and the way we live,” he says. “It’s an exciting time to be a linguist.”

Journal reference: Science, DOI: 10.1126/science.aav3218

https://www.newscientist.com/article/2196580-humans-couldnt-pronounce-f-and-v-sounds-before-farming-developed/

Foreign Accent Syndrome

“Foreign Accent Syndrome” (FAS) is a rare disorder in which patients start to speak with a foreign or regional tone. This striking condition is often associated with brain damage, such as stroke. Presumably, the lesion affects the neural pathways by which the brain controls the tongue and vocal cords, thus producing a strange sounding speech.

Yet there may be more to FAS than meets the eye (or ear). According to a new paper in the Journal of Neurology, Neurosurgery and Psychiatry, many or even most cases of FAS are ‘functional’, meaning that the cause of the symptoms lies in psychological processes rather than a brain lesion.

To reach this conclusion, authors Laura McWhirter and colleagues recruited 49 self-described FAS suffers from two online communities to participate in a study. All were English-speaking. The most common reported foreign accents were Italian (12 cases), Eastern European (11), French (8) and German (7), but more obscure accents were also reported including Dutch, Nigerian, and Croatian.

Participants submitted a recording of their voice for assessment by speech experts, as well as answering questions about their symptoms, other health conditions, and personal situation. McWhirter et al. classified 35 of the 49 patients (71%) as having ‘probably functional’ FAS, while only 10/49 (20%) were said to probably have a neurological basis, with the rest unclear.

These classifications are somewhat subjective in that there are no hard-and-fast criteria for functional FAS. None of the ‘functional’ cases reported hard evidence of neurological damage from a brain scan, but only 50% of the ‘neurological’ cases did report such evidence. The presence of other ‘functional’ symptoms such as irritable bowel syndrome (IBS) was higher in the ‘functional’ group.

In terms of the characteristics of the foreign accents, patients with a presumed functional origin often presented with speech patterns that showed inconsistency or variability. For instance, pronouncing ‘cookie jar’ as ‘tutty dar’ but being able to correctly produce ‘j’, /k/, /g/ and ‘sh’ sounds as part of other words.

But if FAS is often a psychological disorder, what is the psychology behind it? McWhirtner et al. don’t get into this, but it is interesting to note that FAS is often a media-friendly condition. In recent years there have been many news stories dedicated to individual FAS cases. To take just three:

American beauty queen with Foreign Accent Syndrome sounds IRISH, AUSTRALIAN and BRITISH
https://www.mirror.co.uk/news/weird-news/you-sound-like-spice-girl-11993052

Scouse mum regains speech after stroke – but is shocked when her accent turns Russian
https://www.liverpoolecho.co.uk/news/liverpool-news/scouse-mum-regains-speech-after-15931862

Traumatic car accident victim has Irish accent after suffering severe brain injury
https://www.irishcentral.com/news/brain-injury-foreign-accent-syndrome

http://blogs.discovermagazine.com/neuroskeptic/2019/03/09/curious-foreign-accent-syndrome/#.XI58R6BKiUn

Neuroscientists Translate Brain Waves Into Recognizable Speech

by George Dvorsky

Using brain-scanning technology, artificial intelligence, and speech synthesizers, scientists have converted brain patterns into intelligible verbal speech—an advance that could eventually give voice to those without.

It’s a shame Stephen Hawking isn’t alive to see this, as he may have gotten a real kick out of it. The new speech system, developed by researchers at the ​Neural Acoustic Processing Lab at Columbia University in New York City, is something the late physicist might have benefited from.

Hawking had amyotrophic lateral sclerosis (ALS), a motor neuron disease that took away his verbal speech, but he continued to communicate using a computer and a speech synthesizer. By using a cheek switch affixed to his glasses, Hawking was able to pre-select words on a computer, which were read out by a voice synthesizer. It was a bit tedious, but it allowed Hawking to produce around a dozen words per minute.

But imagine if Hawking didn’t have to manually select and trigger the words. Indeed, some individuals, whether they have ALS, locked-in syndrome, or are recovering from a stroke, may not have the motor skills required to control a computer, even by just a tweak of the cheek. Ideally, an artificial voice system would capture an individual’s thoughts directly to produce speech, eliminating the need to control a computer.

New research published today in Scientific Advances takes us an important step closer to that goal, but instead of capturing an individual’s internal thoughts to reconstruct speech, it uses the brain patterns produced while listening to speech.

To devise such a speech neuroprosthesis, neuroscientist Nima Mesgarani and his colleagues combined recent advances in deep learning with speech synthesis technologies. Their resulting brain-computer interface, though still rudimentary, captured brain patterns directly from the auditory cortex, which were then decoded by an AI-powered vocoder, or speech synthesizer, to produce intelligible speech. The speech was very robotic sounding, but nearly three in four listeners were able to discern the content. It’s an exciting advance—one that could eventually help people who have lost the capacity for speech.

To be clear, Mesgarani’s neuroprosthetic device isn’t translating an individual’s covert speech—that is, the thoughts in our heads, also called imagined speech—directly into words. Unfortunately, we’re not quite there yet in terms of the science. Instead, the system captured an individual’s distinctive cognitive responses as they listened to recordings of people speaking. A deep neural network was then able to decode, or translate, these patterns, allowing the system to reconstruct speech.

“This study continues a recent trend in applying deep learning techniques to decode neural signals,” Andrew Jackson, a professor of neural interfaces at Newcastle University who wasn’t involved in the new study, told Gizmodo. “In this case, the neural signals are recorded from the brain surface of humans during epilepsy surgery. The participants listen to different words and sentences which are read by actors. Neural networks are trained to learn the relationship between brain signals and sounds, and as a result can then reconstruct intelligible reproductions of the words/sentences based only on the brain signals.”

Epilepsy patients were chosen for the study because they often have to undergo brain surgery. Mesgarani, with the help of Ashesh Dinesh Mehta, a neurosurgeon at Northwell Health Physician Partners Neuroscience Institute and a co-author of the new study, recruited five volunteers for the experiment. The team used invasive electrocorticography (ECoG) to measure neural activity as the patients listened to continuous speech sounds. The patients listened, for example, to speakers reciting digits from zero to nine. Their brain patterns were then fed into the AI-enabled vocoder, resulting in the synthesized speech.

The results were very robotic-sounding, but fairly intelligible. In tests, listeners could correctly identify spoken digits around 75 percent of the time. They could even tell if the speaker was male or female. Not bad, and a result that even came as “a surprise” to Mesgaran, as he told Gizmodo in an email.

Recordings of the speech synthesizer can be found here (the researchers tested various techniques, but the best result came from the combination of deep neural networks with the vocoder).

The use of a voice synthesizer in this context, as opposed to a system that can match and recite pre-recorded words, was important to Mesgarani. As he explained to Gizmodo, there’s more to speech than just putting the right words together.

“Since the goal of this work is to restore speech communication in those who have lost the ability to talk, we aimed to learn the direct mapping from the brain signal to the speech sound itself,” he told Gizmodo. “It is possible to also decode phonemes [distinct units of sound] or words, however, speech has a lot more information than just the content—such as the speaker [with their distinct voice and style], intonation, emotional tone, and so on. Therefore, our goal in this particular paper has been to recover the sound itself.”

Looking ahead, Mesgarani would like to synthesize more complicated words and sentences, and collect brain signals of people who are simply thinking or imagining the act of speaking.

Jackson was impressed with the new study, but he said it’s still not clear if this approach will apply directly to brain-computer interfaces.

“In the paper, the decoded signals reflect actual words heard by the brain. To be useful, a communication device would have to decode words that are imagined by the user,” Jackson told Gizmodo. “Although there is often some overlap between brain areas involved in hearing, speaking, and imagining speech, we don’t yet know exactly how similar the associated brain signals will be.”

William Tatum, a neurologist at the Mayo Clinic who was also not involved in the new study, said the research is important in that it’s the first to use artificial intelligence to reconstruct speech from the brain waves involved in generating known acoustic stimuli. The significance is notable, “because it advances application of deep learning in the next generation of better designed speech-producing systems,” he told Gizmodo. That said, he felt the sample size of participants was too small, and that the use of data extracted directly from the human brain during surgery is not ideal.

Another limitation of the study is that the neural networks, in order for them do more than just reproduce words from zero to nine, would have to be trained on a large number of brain signals from each participant. The system is patient-specific, as we all produce different brain patterns when we listen to speech.

“It will be interesting in future to see how well decoders trained for one person generalize to other individuals,” said Jackson. “It’s a bit like early speech recognition systems that needed to be individually trained by the user, as opposed to today’s technology, such as Siri and Alexa, that can make sense of anyone’s voice, again using neural networks. Only time will tell whether these technologies could one day do the same for brain signals.”

No doubt, there’s still lots of work to do. But the new paper is an encouraging step toward the achievement of implantable speech neuroprosthetics.

https://gizmodo.com/neuroscientists-translate-brain-waves-into-recognizable-1832155006

https://www.nature.com/articles/s41598-018-37359-z

Computers are now able to predict who will develop psychosis years later based on analysis of their speech patterns.

An automated speech analysis program correctly differentiated between at-risk young people who developed psychosis over a two-and-a-half year period and those who did not. In a proof-of-principle study, researchers at Columbia University Medical Center, New York State Psychiatric Institute, and the IBM T. J. Watson Research Center found that the computerized analysis provided a more accurate classification than clinical ratings. The study, “Automated Analysis of Free Speech Predicts Psychosis Onset in High-Risk Youths,” was recently published in NPJ-Schizophrenia.

About one percent of the population between the age of 14 and 27 is considered to be at clinical high risk (CHR) for psychosis. CHR individuals have symptoms such as unusual or tangential thinking, perceptual changes, and suspiciousness. About 20% will go on to experience a full-blown psychotic episode. Identifying who falls in that 20% category before psychosis occurs has been an elusive goal. Early identification could lead to intervention and support that could delay, mitigate or even prevent the onset of serious mental illness.
Speech provides a unique window into the mind, giving important clues about what people are thinking and feeling. Participants in the study took part in an open-ended, narrative interview in which they described their subjective experiences. These interviews were transcribed and then analyzed by computer for patterns of speech, including semantics (meaning) and syntax (structure).

The analysis established each patient’s semantic coherence (how well he or she stayed on topic), and syntactic structure, such as phrase length and use of determiner words that link the phrases. A clinical psychiatrist may intuitively recognize these signs of disorganized thoughts in a traditional interview, but a machine can augment what is heard by precisely measuring the variables. The participants were then followed for two and a half years.
The speech features that predicted psychosis onset included breaks in the flow of meaning from one sentence to the next, and speech that was characterized by shorter phrases with less elaboration. The speech classifier tool developed in this study to mechanically sort these specific, symptom-related features is striking for achieving 100% accuracy. The computer analysis correctly differentiated between the five individuals who later experienced a psychotic episode and the 29 who did not. These results suggest that this method may be able to identify thought disorder in its earliest, most subtle form, years before the onset of psychosis. Thought disorder is a key component of schizophrenia, but quantifying it has proved difficult.

For the field of schizophrenia research, and for psychiatry more broadly, this opens the possibility that new technology can aid in prognosis and diagnosis of severe mental disorders, and track treatment response. Automated speech analysis is inexpensive, portable, fast, and non-invasive. It has the potential to be a powerful tool that can complement clinical interviews and ratings.

Further research with a second, larger group of at-risk individuals is needed to see if this automated capacity to predict psychosis onset is both robust and reliable. Automated speech analysis used in conjunction with neuroimaging may also be useful in reaching a better understanding of early thought disorder, and the paths to develop treatments for it.

http://medicalxpress.com/news/2015-08-psychosis-automated-speech-analysis.html