University of Florida researchers recently concluded the largest study on audio deepfakes to date, challenging 1,200 humans to identify real audio messages from digital fakes.
Humans claimed a 73% accuracy rate but were often fooled by details generated by machines, like British accents and background noises.
“We found humans weren’t perfect, but they were better than random guessing. They had some intuition behind their responses. That’s why we wanted to go into the deep dive — why are they making those decisions and what are they keying in on,” said co-lead Kevin Warren, a Ph.D. student with the department of Computer & Information Science and Engineering.
The study analyzed how well humans classify deepfake samples, why they make their classification decisions, and how their performance compares to that of machine learning detectors, noted the authors of the UF paper, “Better Be Computer or I’m Dumb: A Large-Scale Evaluation of Humans as Audio Deepfake Detectors.”
The results ultimately could help develop more effective training and detection models to curb phone scams, misinformation and political interference.
The study’s lead investigator, UF professor and renowned deepfake expert Patrick Traynor, Ph.D., has been ear-deep in deepfake research for years, particularly as the technology grew more sophisticated and dangerous.
In January, Traynor was one of 12 experts and industry leaders invited to the White House to discuss detection tools and solutions. The meeting was called after audio messages in January buzzed phones in New Hampshire with a fake President Joe Biden voice discouraging voting in the primary election.
Audio deepfakes use artificial intelligence to create recordings that mimic the sound, tones and inflections of specific people. They, like video deepfakes, have become powerful tools for scammers and political disruptors.
“Audio deepfakes are a growing concern not just within the security community, but the broader world,” noted the paper, which was published earlier this year and won a Distinguished Paper award during the Association for Computer Machinery’s Conference on Computer and Communications Security.
Funded by the Office of Naval Research and the National Science Foundation, the study had participants listen to 20 samples each from three commonly used deepfake datasets. Their answers were compared to machine-learning deepfake detectors that considered the same samples.
When people misidentified audio samples as human voices, it was often because they underestimated technology’s ability to mimic details, such as accents and background noise.
“I do not believe I’ve ever heard a computer-generated voice with a proper English accent,” one participant noted.
Other red flags from participants:
- “People do not say ‘on November twenty-two.’”
- “The pausing was very jerky and unnatural.”
- “The background noise felt like a static computer noise.”
- Arguments for human samples included:
- “I clearly hear laughing in the background, so that tells me this is being recorded live.”
- “I could hear breathing and that made it sound human.”
- “The speech is very enthusiastic; emotion is more of a human trait.”
Participants were not schooled in deepfakes before the study. They came in only with their instincts.
“The bias we found was humans, when they are uncertain, want to lean toward audio being real because that is what they are used to hearing,” Warren said. “While the machine learning models want to lean more toward deepfakes because that is what they have heard a lot. On their default settings, they’re looking at them in different ways.”
Globally, deepfake fraud increased by more than 10 times from 2022 to 2023, according to Sumsub, an identity verification service. At least 500,000 video and audio deepfakes were shared on social media in 2023, according to Deep Media, a media intelligence company whose customers include the U.S. Department of Defense.
“Ultimately, we have to ask, ‘What are we trying to get these [deepfake-detection] systems to do in order to help people? One of our takeaways is that we imagine some future system that is a trained human and a trained machine, but what we see now is that the features these two parties key in on are different and not necessarily complementary,” Traynor said.
Long-term success hinges on how to build detection models that understand that human biases will slowly change.
“We’re really big on getting this out of the lab,” Traynor said. “Ideally, this would help folks in call centers, help folks when the bank calls. It will also help them when they are looking at social media and there is an audio clip of a politician.”
The name of the paper, incidentally, was pulled directly from a participant’s response, when someone was sure they had found an audio deepfake – “‘Better Be Computer, or I am Dumb.”
“Unfortunately, they were wrong,” Traynor said, laughing. “It was a human being.”