Virtual assistants speaking in computer-synthesized voices have become commonplace. In the Amazon online store, the virtual assistant “Aleksa”, in the “Microsoft” environment – “Cortana”, in the “Apple” environment – “Siri” help buyers to find their way. The Latvian-speaking virtual assistant could be called Sandra. Because it was based on the voice of Latvian radio legend Sandra Glāzupa.
Speech synthesis is a technology that allows you to convert written text into spoken format. Speech recognition technology, on the other hand, converts speech into text. Several years ago, scientists chose Sandra’s voice as the basis of the digitized Latvian language for two reasons – firstly because of professionally developed diction and precise intonation. Secondly, the Radio Archives contains a wealth of material with her recordings.
Simply put, Sandra has become like a teacher who teaches a computer to speak the correct Latvian language. The radio announcer has agreed to take part in this project because she is convinced that her current “student” can become a good helper for others. “Everything is evolving and is evolving in this sphere and area. It is possible that for many people who work and learn Latvian, it gives a lot of help and support,” says Glasupa.
Together with Sandra, a team of more than 10 PhDs in artificial intelligence is trained in Riga, Tallinn and Vilnius, specializing in areas such as natural language processing, speech recognition and synthesis, linguistics and machine learning. If several years ago the program mechanically put together the syllables and letters spoken by the narrator. Then now she learns to speak from the text and audio samples given to her.
Well, that is already artificial intelligence. The machine learns from an example. We have given her an example of how one person speaks and she learns to imitate that person to speak in the same way. He is able to pronounce any text that the person has not said, but sounds the same way, “says Raivis Skadiņš, research and development director of” Tilde “.
Tilde began developing voice technology 15 years ago beyond scientific interest to help blind people communicate and use computers. But over the years, it has grown into a sought-after product. Speech recognition is used for media monitoring, analysis of various audio recordings, recording of meetings, deciphering of interviews.
“Speech synthesis, on the other hand, is most widely used in robo-calls or automatic calls, where people are reminded of commitments, offered something or warned about something,” says Kaspars Kauliņš, Tilde’s business development director.
The most heard example in Latvia could be Tet’s smart assistant Anete, which reminds of unpaid bills in phone calls. Simply put, “Anete” is like Sandra’s daughter. True – voice technology still has room to grow.
“Here and there we might want better intonations. Here and there we have some peculiarities in pronunciation. We have a wide and narrow ‘e’ in Latvian, where she is still wrong. And now such emotions. Currently, the synthesizer speaks in such a neutral voice, but we would like “Let there be excitement or a negative touch or fear. We do not yet know in such voices. This is the direction of future development,” says Skadins.
However, no matter what heights of computer speech in the future, the living owner of the computer is convinced that the computer voice of real speakers will not be replaced so soon. “Although this voice I synthesizes at the moment has similar intonations as I speak everyday, it is really recognizable by the intonations, but the person living at the microphone cannot and will not be replaced,” Glasupa points out.