A deep learning lip-reading system that analyses lip movement in video frames and infers spoken content from silent footage — no audio required.
Developed as part of my undergraduate research, this system detects and crops lip regions from video using face detection, then passes the sequence of frames through a CNN-LSTM architecture to classify the spoken words or characters. The goal was to explore whether reliable lip reading was achievable with a moderate-sized dataset and standard academic compute.
Completed as undergraduate thesis research. Architecture diagrams, evaluation metrics and a link to the code repository will be added to this page shortly.