Can we use automated approaches to measure the quality of online political discussion? How to (not) measure interactivity, diversity, rationality, and incivility in online comments to the news

Sjoerd B. Stolwijk, Mark Boukes, Wang Ngai Yeung, Yufang Liao, Simon Münker, Anne C. Kroon, Damian Trilling
Communication Methods and Measures
1–25
September 12, 2025
This article explores the (in)ability of automated tools to measure the deliberative quality of online user comments along the standards set out by Habermas: interactivity, diversity, rationality, and (in)civility. Utilizing a stratified sample of manually coded comments (n = 3,862) responding to news videos on YouTube and Twitter, we examined the performance of rule-based measures (i.e. dictionaries), machine-learning classifiers (conventional and transformer-based) and measurements by generative AI (Llama 3.1, GPT-4o, GPT-4T). We present results for over 50 metrics side-by-side to judge the opportunity costs of choosing one method over another. The results revealed strong variation across different groups of models. Overall, our expectation that more modern methods (transformers and generative AI) outperform the older, simpler ones was confirmed. However, the absolute differences between these model groups strongly depended on the measured concept, and we observed strong variance in performance among models of the same group. We provide recommendations for future research that balance ease of use with the performance of automated measurements, along with important cautions to consider.