Human Themes as Ground Truth

Seven inductively derived human themes (Fire Progression, Evacuation, Smoke/Air Quality, Operational Coordination, Community Solidarity, Geospatial Mapping, Public Accountability) are used as the reference standard. LLM theme assignments are semantically mapped to these human categories, and accuracy is computed across the 1,279 community (pers) tweets.

Ground Truth: Human Inductive Theme Distribution

Distribution of the 1,279 community tweets across the seven inductively derived human themes, assigned via keyword-rule coding. Fire Progression dominates (n = 766, 59.9%), reflecting the high volume of status-update tweets. This distribution provides the reference standard for all LLM accuracy calculations below.

Overall LLM Accuracy Against Human Themes

Accuracy is computed by semantically mapping each LLM's assigned theme to its closest human inductive theme using a predefined conceptual correspondence table, then comparing against the keyword-coded human label. Claude (n = 1,279 tweets) outperforms ChatGPT (n = 880 matched tweets) overall.

Sankey Diagrams: Human Themes → AI Theme Assignments

Each Sankey shows how tweets assigned a given human theme (left) were re-classified by the AI system (right). Flows to the semantically corresponding AI theme indicate correct captures; off-diagonal flows indicate misclassification or thematic splitting.

ChatGPT (n = 880)

Claude (n = 1,279)

Per-Theme Accuracy: ChatGPT vs Claude

Accuracy broken down by human theme reveals where each LLM succeeds and fails. Both systems excel on Evacuation and Smoke/Air Quality; ChatGPT scores near zero on Geospatial Mapping and Operational Coordination; Claude scores lowest on Fire Progression and Community Solidarity.

Cross-Tabulation: Human Themes × AI Theme Assignments

Heatmaps showing how human themes (rows) map to AI themes (columns) based on actual tweet flows. Toggle between classifiers.

ChatGPT

Claude