Harry Le
- Dec 2, 2019
- 6 min read

Emerging technologies and the cost of video localization

You’ve just been asked by a client to dub a short video with a tight turnaround time and a limited budget. How do you tell your client that it’s not possible due to the tight deadline and budget? Or do you?

Emerging technologies are being leveraged to simplify the video localization process. They are making the whole process faster, less expensive and more efficient. Advances in AI and machine learning technologies are now making important inroads in transcription and text-to-speech (TTS) while automation in dubbing can help expedite the post-production process as well. Transcription, TTS and dubbing are essential in video localization, yet they add to the project cost because traditionally they have been done manually. Automating these aspects of the process can and will make localizing a video much more viable than ever before.

Transcription

Manually transcribing speech into text has always been considered the best way to do audio transcriptions. It is perceived as more accurate because manual transcribers can choose to slow the playback speed of the audio or video files so they can type at their own pace; however, it also takes more time and money as human effort is involved. With the advent of AI, it is now possible for machine (automated) transcription to be a faster, reliably accurate and more economical form of audio transcription.

With automated transcription, the computer listens to and types out what’s being said in the audio or video files using speech-recognition technology. Although the accuracy level is not perfect, it is still close enough to be acceptable as the turnaround time is much faster and the cost is much lower than manual transcription. The near-perfect accuracy will invariably facilitate the human editing and reviewing process.

A subfield of computer science and AI, natural language processing (NLP) aims for computers to understand, interpret and manipulate human language. Most NLP techniques rely on machine learning to derive meaning from human languages. NLP entails applying algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand. When the text has been provided, the computer will use algorithms to extract meaning associated with every sentence and collect the essential data from them (known as text mining or text analytics). The more data the computer collects and analyzes, the more accurate it will become.

Limitations of automated transcriptions

AI and NLP are emerging technologies and there is still much work to be done in order to create a computer that can transcribe human speech with the same accuracy as human transcribers. Obviously, there are limitations to fully accurate automated transcriptions, mainly because the spoken language is full of irregularities like pauses, filler words, mispronunciations and nonstandard grammar. This makes it tricky for AI to classify it and understand its patterns.

In addition to the complications the spoken language poses to computers, other elements in the audio or video can affect the accuracy of the transcription. AI can’t account for all the ambiguities when there are multiple speakers in the audio: people interrupting each other, some people speaking more loudly than others, several people speaking at the same time and so on. Computers also struggle with audio with background noises and traffic sounds. Currently, also, NLP research is focused on primarily American and British accents. Transcribing speakers with different accents using automated technology may result in inaccuracies.

Although automated transcriptions have limitations, they are becoming more and more accepted as a viable alternative to traditional manual transcriptions. To counter the accuracy issue highlighted as a limitation, combining both manual and automated transcription may be the ultimate solution for accuracy and speed. A human editor can be used as a quality assurance mechanism to ensure accuracy at the time of delivery. Machine-generated transcriptions are now preferred for projects with tight turnaround time and budget. For extra assurance in accuracy, using both manual and automated methods is the ideal solution.

Text-to-speech

Instead of using the human voice to record the voiceover (VO) speech of the video, companies are now exploring the possibility of using TTS to record the VO. The automated voice is becoming more and more natural-sounding, less robotic and machine sounding.

For certain usages, like interactive voice response or training videos with no presenter on-screen, using TTS is less costly and has a quicker turnaround time. There is no need to hire professional human voice talent, book studio time and hire sound engineering services — these costs are traditionally quite prohibitive and are the main reason why video localization with dubbing is usually substituted with subtitles instead. Moreover, if changes are needed, a rerecording is not instant — we need to rebook the voice talent’s time and may need to book the studio again. If the client wants to make additional changes, extra costs will be incurred.

With TTS, the cost is minimal, and in some instances, it’s free. And it’s instant — with a click of a button, you’ll get the voice recording in minutes. If there are changes to the script and a rerecording is needed, that’s fine. Just upload the updated script and generate the voice recording again.

TTS does present a couple of challenges that require post-engineering work. Firstly, the pronunciation of a certain word or sound unit may be different between various languages. For example, in the Japanese word genba, the first syllable is pronounced with a hard g sound as in getting, not a soft g sound as in gem. To make sure the automated voice makes the correct pronunciation, a speech synthesis markup language (SSML) formatted tag can be inserted into the audio file code, like a special instruction. More information can be found on the W3’s SSML 1.1 specification page.

The second challenge TTS poses involves post-editing adjustment of the voice recording. Sometimes the recorded segment may be a bit longer or shorter and may not sync perfectly with the video. Minor post-editing adjustments will then be required to tweak the TTS recording. Such cost and time needed will still be less than using the traditional method.

Automated dubbing

When we think of dubbing, we think of movies dubbed in another language, meaning you can hear the audio in that target language with each character using a voice similar to the original one and their lips are synced with the target language. Dubbing in movies, TV commercials and TV shows and documentaries demands a more rigid production process and much stricter quality control expectation, which makes automation not the ideal solution, at least not for now. However, automation technology in dubbing has advanced so much that it can be implemented for videos that do not require such strict and exact results, such as short promo, training, eLearning and help videos.

Automated dubbing technology will take the TTS or human voice recording and synchronize it automatically with the video. Traditionally, this post-production stage requires a sound engineer to manipulate the audio recording to match the corresponding segment in the video — a time-consuming and costly part of video localization. It means going through each segment of the recording and making it match with the video, which is not always possible. To fix this synchronization issue, you can either edit the video and audio segment by segment or ask the voice talent to match the length of the video during the recording process. Even worse, you may need to rerecord the whole recording, which means you would need to rebook the voiceover talent as well as the recording studio, and then have the sound engineering to redo the synchronizing.

Automation as a solution

As you can tell, this traditional way of dubbing is too rigid and time-consuming, and is, therefore, one area that benefits from automation. There are now proprietary technologies emerging that allow you to upload your audio recording and video files into a cloud-based platform, and in a matter of minutes, a synchronized video in another language will be available for download. This eliminates the need to manipulate the sound waves, manually cutting here and there to make the audio match the video. On top of that, these technologies also include mixing in background music and sound effects. Automating this synchronization process would reduce the hours and dollars the manual method would usually require, resulting in lower costs and shorter turnaround times.

Technological advances are providing an alternative to the labor-intensive work and high costs of manual processes in video localization. Years ago, to have a script recorded by a machine meant a robotic sounding voice reading in a stilted monotone manner. Now, AI and NLP have made it possible to render machine-recorded speech to sound as humanly natural as possible. With these advancements in technology and innovation, dubbing a video is no longer a luxury, deemed only possible for deep-pocket companies. Dubbing is now considered the ideal method for videos to market your product, to train your staff across the globe and to conduct eLearning courses. These emerging technologies will only improve and get more and more sophisticated, so it’s time to stop thinking of dubbing a video as a luxury, but instead, as a necessity.