Seven inductively derived human themes (Fire Progression, Evacuation, Smoke/Air Quality, Operational Coordination, Community Solidarity, Geospatial Mapping, Public Accountability) are used as the reference standard. LLM theme assignments are semantically mapped to these human categories, and accuracy is computed across the 1,279 community (pers) tweets.
Distribution of the 1,279 community tweets across the seven inductively derived human themes, assigned via keyword-rule coding. Fire Progression dominates (n = 766, 59.9%), reflecting the high volume of status-update tweets. This distribution provides the reference standard for all LLM accuracy calculations below.
Accuracy is computed by semantically mapping each LLM's assigned theme to its closest human inductive theme using a predefined conceptual correspondence table, then comparing against the keyword-coded human label. Claude (n = 1,279 tweets) outperforms ChatGPT (n = 880 matched tweets) overall.
Each Sankey shows how tweets assigned a given human theme (left) were re-classified by the AI system (right). Flows to the semantically corresponding AI theme indicate correct captures; off-diagonal flows indicate misclassification or thematic splitting.
Accuracy broken down by human theme reveals where each LLM succeeds and fails. Both systems excel on Evacuation and Smoke/Air Quality; ChatGPT scores near zero on Geospatial Mapping and Operational Coordination; Claude scores lowest on Fire Progression and Community Solidarity.
Heatmaps showing how human themes (rows) map to AI themes (columns) based on actual tweet flows. Toggle between classifiers.