

However, it’s worth noting that much of the scientific literature on the topic is generally circumspect, not least because even well-intentioned objective research in this sphere risks to cross over into racial profiling and the promulgation of existing stereotypes.

The different ways that Japanese natives (as well as certain other West and East Asian natives) leverage facial expressions against the content of their speech already make them a greater challenge for sentiment analysis systems. Though there’s currently no empirical understanding of which languages are the most difficult to lip-read in the complete absence of audio, Japanese is a prime contender. The model achieved a 4.1x improved performance over human lip-readers.īesides the problem of eliciting an accurate transcript in real time, the challenge of interpreting speech from video deepens as you remove helpful context, such as audio, ‘face-on’ footage that’s well-lit, and a language/culture where the phonemes/visemes are relatively distinct. This work followed on from a separate Oxford/Google initiative from the previous year, entitled LipNet, a neural network architecture that mapped video sequences of variable length to text sequences using a Gated Recurrent Network (GRN), which adds functionality to the base architecture of a Recurrent Neural Network ( RNN).
#Automated lip reading software tv
The model was trained on thousands of hours of BBC TV footage. In 2017, Lip Reading Sentences in The Wild, a collaboration between Oxford University and Google’s AI research division produced a lip-reading AI capable of correctly inferring 48% of speech in video without sound, where a human lip-reader could only reach a 12.4% accuracy from the same material.
#Automated lip reading software software
Among many other examples and projects, In 2006 the use of automated lip-reading software captured headlines when used to interpret what Adolf Hitler was saying in some of the famous silent films taken at his Bavarian retreat, though the application seems to have vanished into obscurity since (twelve years later, Sir Peter Jackson resorted to human lip-readers to restore the conversations of WW1 footage in the restoration project They Shall Not Grow Old). Machine-driven lip-reading has been an active and ongoing area of computer vision and NLP research over the last two decades. The researchers conclude that the use of a higher volume of text information, combined with grapheme-to-phoneme and viseme mapping, promises improvements over the state of the art in automated lip-reading machine systems, while acknowledging that the methods used may produce even better results when incorporated into more sophisticated current frameworks. In a test against the 2017 Oxford research Lip Reading Sentences In The Wild (see below), the Video-To-Viseme method achieved a word error rate of 62.3%, compared to 69.5% for the Oxford method. The Tehran research also incorporates the use of a grapheme-to-phoneme converter. The model was applied without visual context against the LRS3-TED dataset, released from Oxford University in 2018, with the worst word error rate (WER) obtained a respectable 24.29%. Above, traditional sequence-to-sequence methods in a character model below, the addition of viseme character modeling in the Tehran research model.
