Lip Reading CNN — Hibiki Shimizu Works

Full write-up coming soon — Detailed documentation for this project is currently being prepared.

Overview

Developed as part of my undergraduate research, this system detects and crops lip regions from video using face detection, then passes the sequence of frames through a CNN-LSTM architecture to classify the spoken words or characters. The goal was to explore whether reliable lip reading was achievable with a moderate-sized dataset and standard academic compute.

Key Technologies

Frame extraction: OpenCV for video-to-frame conversion; dlib / MediaPipe for face detection and lip region crop
Feature extraction: CNN backbone (ResNet-based) for spatial feature extraction per frame
Sequence modelling: LSTM / BiLSTM to predict characters or words from the frame sequence
Training data: GRID Corpus and LRS2 public datasets
Stack: Python / PyTorch / OpenCV

Status

Completed as undergraduate thesis research. Architecture diagrams, evaluation metrics and a link to the code repository will be added to this page shortly.