Dynamic 3D Scene Reconstruction from Classroom Videos

Published in IEEE Signal Processing Society 58th Asilomar Conference on Signals, Systems, and Computers, 2024

The paper describes the development of a system for estimating 3D speaker geometry from raw images of collaborative classroom videos. The proposed system integrates methods for 2D and 3D pose estimation with depth estimation and camera calibration to detect and reconstruct the 3D speaker geometry of a collaborative group of students. Results on the Human3.6M dataset show that the system can estimate 3D poses reasonably well without the need to pre-train on the Human3.6M dataset. Furthermore, for classroom videos, the proposed system outperformed a baseline approach trained on the Human3.6M dataset. The proposed system is used to provide the 3D speaker geometry to a new speaker diarization system that performs well in noisy classroom environments.