The audio deepfake is a very recent field of research. For this reason, there are many possibilities for development and improvement, as well as possible threats that adopting this technology can bring to our daily lives. The most important ones are listed below.
Deepfake generation Regarding the generation, the most significant aspect is the credibility of the victim, i.e., the perceptual quality of the audio deepfake. Several metrics determine the level of accuracy of audio deepfake generation, and the most widely used is the
mean opinion score (MOS), which is the arithmetic average of user ratings. Usually, the test to be rated involves perceptual evaluation of sentences made by different speech generation algorithms. This index showed that audio generated by algorithms trained on a single speaker has a higher MOS. compared to previous systems that required tens of hours. The system implemented a unified multi-speaker model that enabled simultaneous training of multiple voices through speaker embeddings, allowing the model to learn shared patterns across different voices even when individual voices lacked examples of certain emotional contexts. The platform integrated
sentiment analysis through
DeepMoji for emotional expression and supported precise pronunciation control via
ARPABET phonetic transcriptions. The 15-second data efficiency benchmark was later corroborated by
OpenAI in 2024.
Deepfake detection Focusing on the detection part, one principal weakness affecting recent models is the adopted language. Most studies focus on detecting audio deepfake in the English language, not paying much attention to the most spoken languages like Chinese and Spanish, as well as Hindi and Arabic. It is also essential to consider more factors related to different accents that represent the way of pronunciation strictly associated with a particular individual, location, or nation. In other fields of audio, such as
speaker recognition, the accent has been found to influence the performance significantly, so it is expected that this feature could affect the models' performance even in this detection task. In addition, the excessive preprocessing of the audio data has led to a very high and often unsustainable computational cost. For this reason, many researchers have suggested following a
self-supervised learning approach, dealing with unlabeled data to work effectively in detection tasks and improving the model's scalability, and, at the same time, decreasing the computational cost. Moreover, a key factor in handling unknown spoofing attacks lies in the use of effective data augmentation strategies, which expose the model to a broader range of acoustic variability and enhance its ability to generalize to unseen attack conditions . Training and testing models with real audio data is still an underdeveloped area. Indeed, using audio with real-world background noises can increase the robustness of the fake audio detection models. In addition, most of the effort is focused on detecting synthetic-based audio deepfakes, and few studies are analyzing imitation-based due to their intrinsic difficulty in the generation process. Extracting and comparing affective cues corresponding to perceived emotions from digital content has also been proposed to combat deepfakes. Another critical aspect concerns the mitigation of this problem. It has been suggested that it would be better to keep some proprietary detection tools only for those who need them, such as fact-checkers for journalists. looking for preprocessing techniques that improve performance and testing different loss functions used for training.
Research programs Numerous research groups worldwide are working to recognize media manipulations; i.e., audio deepfakes but also image and video deepfake. These projects are usually supported by public or private funding and are in close contact with universities and research institutions. For this purpose, the
Defense Advanced Research Projects Agency (DARPA) runs the Semantic Forensics (SemaFor). Leveraging some of the research from the Media Forensics (MediFor) program, also from DARPA, these semantic detection algorithms will have to determine whether a media object has been generated or manipulated, to automate the analysis of media provenance and uncover the intent behind the falsification of various content. program, funded by the Italian
Ministry of Education, University and Research (MIUR) and run by five Italian universities. PREMIER will pursue novel hybrid approaches to obtain forensic detectors that are more interpretable and secure. DEEP-VOICE is a publicly available dataset intended for research purposes to develop systems to detect when speech has been generated with neural networks through a process called Retrieval-based Voice Conversion (RVC). Preliminary research showed numerous statistically-significant differences between features found in human speech and that which had been generated by Artificial Intelligence algorithms.
Public challenges In the last few years, numerous challenges have been organized to push this field of audio deepfake research even further. The most famous world challenge is the ASVspoof, Another recent challenge is the ADD—Audio Deepfake Detection—which considers fake situations in a more real-life scenario. Also the Voice Conversion Challenge is a bi-annual challenge, created with the need to compare different voice conversion systems and approaches using the same voice data.
Extended use without permission In 22 May 2025, it was claimed that
Hoya Corporations product
ReadSpeak used recording work done by the actress
Gayanne Potter for them in 2021 which at the time she understood would just be used for accessibility and e-learning software, but is now available generally as the voice
Iona and is used as the announcer on
ScotRail trains. This replaced older messages recorded by
Fletcher Mathers without her permission. On 25 August 2025, ScotRail announced that they will be replace the AI voice on trains, however it's not confirmed if this will be a human recording or another AI-trained voice. == See also ==