A System For Interconverting Spoken and Written Language (SYSTINTERCON) conceived and described to others in 1995. The concept described here will convert a spoken sound to syntertext on the computer screen. That text will have encoded within it all the information that can be gleaned from the Fast Fourier Transform (FFT) of the sound. By reversing the process, the text can be converted to audible sounds that are an exact reproduction of the original spoken sound including all nuances that identify the original speaker. The SYSTINTERCON process can be expanded to include the text encoding of a limitless list of sound sources that human and other ears can hear.
The syntertext will be a new global text, as opposed to American Standard Code for Information Interchange text (ASCII), to include all phonemes of all spoken human and animal languages, all music, insect sounds, industrial sounds and common sounds. There will be a one-to-one correspondence between the phoneme letter and the utterance. Developing this syntertext will be an enormous task and will require contributions from all corners of the world. The syntertext will include, as visual features, the emotions of the speaker be they human or otherwise. Just from the appearance of the page or display one can tell a threat from a love note. Part of the metadata of this text will be the name, address, etc. of the speaker.
The human and animal voices are unique because of the shape of the vocal cavities and the way that shape changes when speech is uttered. The frequency (musical pitch) of the voice intonation is also important in the data that comprise a voiceprint.
A voiceprint is a sound spectrogram that is a graph created by a Fast Fourier Transform (FFT) of the sound sample under study. This graph shows a sound’s frequency on the vertical axis. The elapsed time or real time is shown on the horizontal axis. The color shades or gray density of any given point on the two dimensional graph indicate the acoustic intensity or the energy of the sound.
The image below is a voice print of the spoken word “articulation”.

For this invention, the data presented by the Fast Fourier Transform will be maintained as digital numbers rather than presented as an image. This data will be supplied to an algorithm to populate variables in the PostScript formulas that describe the letters of syntertext. PostScript formulas can be used to describe anything presented on a computer screen in all its beautiful variations.
At this point in this patent description, the PostScript output will be limited to written letter forms (fonts, colors, size, textures, etc.) of the chosen language: French, English, Chinese, Japanese, Swahili, Zhosa, Urdu, and even extinct languages such as Elamite. The spoken Elamite language was written in the extinct Elamite cuneiform, a logo-syllabic script. As can be seen from the breadth of languages listed here, SYSTINTERCON will be applicable to all languages that have at least a spoken form. Converting a spoken only language to syntertext, would be a most useful function for SYSTINTERCON.
The following internet news item about the Google Translatotron indicates that it will use much of the same spectrogram technology as SYSTINTERCON. Perhaps we can work together to realize a global language system.
Google’s Translate
By Victoria Bell For Mailonline
Published: 07:32 EDT, 17 May 2019 | Updated: 07:32 EDT, 17 May 2019
“Google’s Translate can now listen to a language and make it into an audio translation in the original speaker’s voice. The tool is able to convert language without the need for a text-based process. It also preserves the person’s original voice in the audio clip of the new language. Currently, Google Translate’s system uses automatic speech This transcribes speech which is then converted into language. Now it can directly translate speech from one language into speech in another language, without relying on a text representation in either language.
Google has announced a new translate tool which converts one language into another and preserves the speaker’s original voice. The tech giant’s new system works without the need to convert it to text before. A first-of-its-kind, the tool is able to do this while retaining the voice of the original speaker and making it sound ‘more realistic’, the tech giant said. Google claims the system, dubbed ‘Translatotron’, will be able to retain the voice of the original speaker after translation while also understanding words better. Google has announced that their new translate tool will convert one language into another without the intermediate text-based process. The first of its kind tool is able to do this while retaining the voice of the original speaker and making it sound ‘more realistic’
It can directly translate speech from one language into speech in another language, without relying on the intermediate text representation in either language, as is required in cascaded systems. Translatotron’ is the first end-to-end model that can directly translate speech from one language into speech in another language,’ Google wrote in a blog post. Currently, Google Translate’s system uses three stages. Automatic speech recognition, which transcribes speech as text; machine translation, which translates this text into another language; and text-to-speech synthesis, which uses this text to generate speech.
How Does ‘TRANSLATOTRON’ Work?
The tech giant now says it will use a single model without the need for text. This system avoids dividing the task into separate stages.’ the blog post by Google AI software engineers Ye Jia and Ron Weiss said. According to the company, this will mean faster translation speed and fewer errors. The system retains the speaker’s voice by using spectrograms, a visual representation of the soundwaves, as its input. Translatotron is based on a sequence-to-sequence network which takes source spectrograms, a visual representation of the sound waves, as input and generates spectrograms of the translated content in the target language. It also makes use of two other separately trained components: a neural vocoder that converts output spectrograms to waveforms. Optionally, a speaker encoder that can be used to maintain the character of the source speaker’s voice in the synthesized translated speech. During training, the sequence-to-sequence model uses a multitask objective to predict source and target transcripts at the same time as generating target spectrograms. However, no transcripts or other intermediate text representations are used during inference.
The system retains the speaker’s voice by using spectrograms, a visual representation of the soundwaves, as its input. It then generates these spectrograms, also relying on a neural vocoder and a speaker encoder, meaning the speaker’s voice stays the same once translated. It then generates these spectrograms, also relying on a neural vocoder and a speaker encoder, meaning the speaker’s vocal characteristics stay the same once translated. Google admitted that the system needs refining through further training of the algorithm. Sound clips published in the post were more ‘realistic’ than a machine voice, but still unmistakably computer-generated.“
By Victoria Bell For Mailonline
Published: 07:32 EDT, 17 May 2019 | Updated: 07:32 EDT, 17 May 2019
You must be logged in to post a comment.