Final project for CPSC 185.
Fine-tuning a Qwen2.5-Omni-3B model to predict typed text from overhead video, with audio, of typing on a keyboard.
- The Conda environment is given at
environment.yml
. Create the environment withconda env create -f environment.yml
.
0training/collect_gui.py
is a GUI Python program that prompts sentences to type and records video and keystroke data.- Our keyboard video dataset, with full videos and keystroke timing data, is available at this HuggingFace dataset.
Sample Data
Text: The source said if approved, the authority would allow a transaction to be carried out.
788.mp4
- Install
llamafactory
viapip install -e ".[torch,metrics]"
in the1training/LLaMA-Factory
directory. - Use
1training/0train.ipynb
to generate the augmented dataset and ensure that it's at1training/LLaMA-Factory/data/keyboard_videos
. Look at the relative paths inkeyboard.json
to understand the directory structure for the .mp4 and .wav files. - Run training via
1training/train.sh
which uses the configuration at1training/train_keyboard.yml
.