Publication
Automatically understanding and facilitating effective group collaboration remains a core challenge across social science and computational research. While prior work has focused on fine-grained social cues or coarse behavioral patterns, understanding the intermediate structure of dialogue—how sequences of utterances (discussion segments) reflect evolving group knowledge—is critical. This paper introduces a novel discussion segmentation framework and taxonomy for modeling collaborative problem-solving (CPS) processes, classifying segments into categories such as “task progress”, “task attempt”, and “grounding”. We collected and annotated over 1,700 multi-modal discussion segments from 21 group discussions, both in-person and online, based on this taxonomy. We further propose a baseline model that integrates audio, visual, and textual signals to classify discussion segments with an average F1 score of 69.3%. Notably, this lightweight expert model achieves performance comparable to, and sometimes exceeding, proprietary state-of-the-art multimodal large language models. These findings highlight the promise of sequence-level discourse analysis for automated facilitation and human-agent collaboration.



