Unmasking the Digital Deception: A Comprehensive Survey of Large Vision Models (LVMs) for Deepfake Detection

Ashraf, Ahmed; Khoriba, Ghada; Ghoneim, Amr

doi:10.21608/fcihib.2025.318470.1121

Unmasking the Digital Deception: A Comprehensive Survey of Large Vision Models (LVMs) for Deepfake Detection

Document Type : Original Article

Authors

¹ Department, of Computer Science, Faculty of Computers and Artificial Intelligence, Helwan University, Cairo, Egypt

² Centre for Informatics Science (CIS), School of Information Technology and Computer Science (ITCS), Nile University, Giza, Egypt

³ Department of Computer Science, Faculty of Computers and Artificial Intelligence,, Helwan University, Cairo, Egypt

10.21608/fcihib.2025.318470.1121

Abstract

Digital videos are among the most prevalent types of multimedia in everyday life. They are extensively shared on social media channels like Facebook, Instagram, WhatsApp, and YouTube via the Internet. The rapid advancements in artificial intelligence (AI), machine learning (ML), and deep learning (DL) have led to the development of sophisticated techniques and tools for multimedia manipulation. These technological innovations have facilitated the creation of falsified digital images and videos. Consequently, detecting these manipulated digital media has become a critical concern, necessitating a thorough examination of current forgery detection methodologies. Our extensive survey categorizes these methodologies across three visual, audio, and multimodal audio-visual domains. The survey broadly examines deepfake detection strategies, with a particular emphasis on applying recent deep learning techniques, specifically large vision models (LVMs). It includes an in-depth comparative analysis of various deep learning approaches, focusing on LVMs, and demonstrates their superior performance relative to earlier techniques. Multiple metrics and datasets support this analysis. Additionally, it offers new solutions and guides future research in multimodal deepfake detection by exploring new dimensions of video manipulation, such as text overlays and motion dynamics. It also highlights the growing importance of expanding the role of LVMs and underscores the importance of developing comprehensive and diverse datasets to enhance the robustness and validation of detection techniques.

Keywords