I Invented a Way to Speak to an AI, Keeping Your Privacy
Cutting-edge smart assistants, like GPT-4o, could be awesome for voice interaction with an AI, but sometimes voice interaction itself has its drawbacks:
- You may cringe about talking to a device in front of others and be afraid to look silly.
- Sometimes, you are not supposed to talk, like in an office meeting (even less so to talk on a phone).
- You don't want others to overhear private information, such as dictating a phone number inside a train wagon full of people.
I was thinking about those issues and thought that perhaps the same AI that is bringing this problem could help with a solution. And I got an idea. I called it "Silent Voice."
With Silent Voice, you would put the phone in front of your mouth and talk out your request, but without applying your voice—not even a whisper.
How is that possible? Is it a form of lip-reading? No. Is it a way of amplifying any noise coming from your mouth? Nope. What is it, then?
How Silent Voice works
Silent Voice consists of an ultrasound generator and speaker, which throws short ultrasound pulses. You have to activate Silent Voice first, approach the ultrasound small speaker to your mouth, and start speaking normally.
Not exactly "speaking normally" because that would defeat the purpose. You speak almost normally but without applying your voice. You don't need to whisper, as Silent Voice doesn't work with sound at all—it uses the ultrasound that enters your mouth and bounces back, being reflected and disturbed in multiple and complex ways.
The critical part of Silent Voice is that the reflected ultrasound, which has been distorted by the vocal tract (mouth's internal parts such as the tongue), is picked up by a microphone, digitized, and then passed to a Machine Learning classifier. The classifier predicts which "phoneme" corresponds to a given vector (a phoneme is each elemental sound, similar to a letter).
Once phonemes are predicted, Silent Voice uses standard speech recognition technology to identify the corresponding letters and words. The recognized text is then delivered to the operating system and then to any application you are using, such as WhatsApp. In the end, WhatsApp will show the text you dictate without using your voice.
TLDR version of "Silent Voice"
Somebody told me that "Silent Voice" can be described by the following equation:
Silent Voice = Ultrasound echo + Machine Learning
That's it.
Easier said than done, of course.
The Machine Learning process
One critical phase is the classification of phonemes from the ultrasound echo picked up by the microphone. In (supervised) Machine Learning, there are several phases that I'll explain—adapted to the Silent Voice case:
- Raw data is collected, with "tags" indicating to which class each sample belongs. In Silent Voice, each sample contains the echo (digitized signal) of a single ultrasound pulse. The tag is an identification (provided by a human) of which phoneme the user pronounced at that exact moment.
- Features are extracted from ultrasound samples, so this one is converted to a vector of signal characteristics like its intensity and many more (this part is way too technical to be described here). The result of this phase is a matrix called "dataset," where the columns are the calculated features and the rows are the samples.
- The dataset is partitioned into "train" and "test" parts.
- Using the training partition, a previously chosen classifier is trained (more on this below). Training is computationally intensive, but fortunately, training procedures are highly optimized in standard platforms like Google's Colab.
- The classifier's predictions for the test partition are calculated, and then the predictions' quality is evaluated using standard metrics like "accuracy," "precision," "recall," and many more.
The choice of the exact classifier (SVM, Random Forest, Neural Nets, etc.) can have a big impact on the quality of the predictions, so several are tried in practice to see which one works best. This is a highly empirical process.
Once the classifier is trained and its performance verified, we can use it to predict, in the case of Silent Voice, the phoneme corresponding to the last ultrasound echo collected by the microphone. This information is pure gold.
What would Silent Voice look like in real life?
Silent Voice is mostly intended to enhance smartphones, which means that the user wouldn't see anything at all different from a regular phone. The "augmented" smartphones would have an ultrasound speaker (which could be an adapted version of the speaker they have at the bottom) and an ultrasound microphone near the speaker to collect the echo. This is the "embodiment" I presented in the patent I mention at the bottom of this post. All the Machine Learning parts would most likely need just software for advanced smartphones or a few additional chips.
In principle, it is possible to build a "Silent Voice peripheral" with the ultrasound speaker and mic, as well as all the electronic processing, until the text is obtained and sent to a phone or a computer via Bluetooth. But from the cost point of view, it makes much more sense to integrate Silent Voice into the smartphone you carry with you anyway.
Other Silent Voice use cases
I think the most relevant application of Silent Voice is dictating chat messages or emails, even with people around you, without losing privacy or looking silly or weird when talking to a device.
But there are other scenarios where Silent Voice could be a lifesaver:
- If you get a phone call in an extremely noisy environment, you'll have to yell to make you understand… unless you use Silent Voice. With it, you can get a phone call, and the system will replace ambient sound with just an artificial voice pronouncing the same phonemes you are pronouncing. Eventually, it will be possible to use a voice that imitates your own.
- There are people who, due to an illness or accident, have damaged their throats and cannot produce their voices normally or at all. However, the Silent Voice system could allow these people to produce speech simply by moving the oral cavity without needing to correct the throat problem. Although the number of people with voice loss is not very high, their cases are important because it can be a disabling injury.
Where I got the idea from
Previously, I had worked as an AI researcher at a university, and one of my PhD students (Edgar) worked on a way of detecting (even counting) human bodies from outside a room using mostly regular WiFi. I clarify that we (Edgar and myself) are not spies or anything of the sort. However, the problem was interesting, and we made a nice contribution to the area by reporting the findings in an academic paper. I'm not going to bore you with the published paper: you can read a very digestible account of the techniques involved in the post "Do You Know that Human Bodies Can Be Seen From Outside a Room with WiFi?" published here at Medium.
I'm recalling that research work about detecting human bodies because the technique used in "Silent Voice" is basically the same. The idea is:
(…) human bodies produce characteristic disturbances on electromagnetic signals traversing them, making it possible to analyze the disturbances for different purposes—counting people in our case.
I thought this idea of analyzing disturbances could be applied to detecting the position of the mouth, including the jaw, the tongue, and everything else—without actually analyzing the vocal tract positions.
What if we send an acoustic signal, like an ultrasound pulse, to the mouth, then pick the ultrasound coming back with a microphone and compare it with patterns corresponding to the different letters we pronounce (more exactly, the phonemes)?
Is Silent Voice entirely original?
No, Silent Voice is one of what is called "Silent Speech Interface." It is intended, well, to give speech information without sound, but the specific solutions are wildly varied:
- Some use medical ultrasound imaging equipment attached to the chin –the method analyzes the images.
- "Non-audible murmur" technology, which tries to amplify whispers.
- Electromagnetic and radar analysis of vocal tract activity.
- Surface electromyography and encephalographic sensors.
- Brain implants!
Silent Voice differs by relying entirely on Machine Learning analysis of ultrasound disturbances instead of "analytical" solutions (that is, mathematical models trying to predict the exact tongue position, and so on). Analytical solutions are extremely hard to develop, but I avoided them not because of laziness but because I wanted a data-driven Machine Learning-based solution.
In most situations where you can have lots of data, Data Science trumps mathematical ad-hoc models any day of the week.
Final remarks
I registered Silent Voice in a "Provisional Patent Application" at the USPTO, and they gave me the number "63/637,554" for it. Patenting like this is the most basic form of intellectual property protection for my idea.
Silent Voice is in the concept phase, meaning that it hasn't been implemented as a working prototype. That is because I no longer work in an AI lab—I retired from my full-time job 5 years ago. So, I couldn't know exactly how well Silent Voice works.
But I know Silent Voice could work because it follows a data-driven process that has succeeded in so many recent projects, particularly the person-detection one mentioned above.
Perhaps you think that talking –even silently– to a device makes you look silly anyway. But normally, you'd put the phone (and your hand) in front of your mouth (like in the figure at the top), and this would conceal your "talking." In the end, only real-life usage of Silent Voice will make evident its associated social concerns.
What I intend to do with Silent Voice is to give this tech to somebody who could implement it and make it as impactful as I think it could be.
With Silent Voice, in some years, we could all have conversations with AI systems without losing our Privacy. Yes, voice is a great way to communicate with AI, but not at the expense of our privacy.
— Get my personally curated AI news analysis and tech explainers with my short free newsletter, "The Skeptic AI Enthusiast," at https://rafebrena.substack.com/