TakeMeter

A fine-tuned text classifier that evaluates discourse quality in r/nba — distinguishing analysis, hot takes, and reactions.

Evaluation summary

Metric	Baseline (Groq)	Fine-tuned (DistilBERT)
Test accuracy	50.0%	46.9%
Best per-class F1	—	analysis: 0.57
Worst per-class F1	—	reaction: 0.18
Main error pattern	—	`hot_take` and `reaction` misclassified as `analysis` when posts mention stats

Fine-tuning did not beat the zero-shot baseline. See full report below.

Community choice

r/nba is Reddit's main NBA discussion community. Discourse there ranges from stat-heavy tactical breakdowns to emotional game reactions to bold unsupported opinions — distinctions that regulars actually care about. The subreddit is text-heavy, public, and large enough to collect 200+ diverse examples without authentication.

Label taxonomy

Three mutually exclusive labels grounded in how r/nba participants talk about post quality:

`analysis`

Definition: The post makes a structured argument backed by specific, verifiable evidence — statistics, historical comparisons, tactical observations, or film-watching detail — where the evidence would support the claim even if you removed the opinion framing.

Examples:

"Their offensive rating drops 8 points per 100 possessions when Jokic sits. Denver's bench simply can't run the same actions without his passing gravity."
"Since 2019, teams that switch everything on Brunson have held him below 40% FG in the playoffs. Miami's scheme in round 2 exploited exactly that."

`hot_take`

Definition: A bold, confident opinion stated without genuine supporting evidence; the post asserts rather than argues, even if it sounds informed or cites a single decorative stat.

Examples:

"The Lakers are never winning another championship with this roster. It's over."
"Refs absolutely decided that game. The league wants the Celtics in the Finals."

`reaction`

Definition: An immediate emotional response to a specific event with little or no argument; the post is expressing a feeling in the moment rather than making a claim to be evaluated.

Examples:

"WHAT A DUNK OH MY GOD"
"I'm actually sick. That was the worst officiating I've ever seen."

Full label design, edge-case rules, and additional examples are in planning.md.

Annotated dataset

Detail	Value
Source	Public r/nba comments via PullPush archive API
File	`data/labeled_dataset.csv`
Total examples	210
Split	70% train / 15% val / 15% test (stratified, handled by Colab notebook)
Test set size	32

Label distribution:

Label	Count	%
analysis	70	33.3%
hot_take	73	34.8%
reaction	67	31.9%

Labeling process: Comments were collected with keyword-diverse PullPush queries, then pre-labeled with a rule-assisted script encoding the decision rules from planning.md. Each label was reviewed and difficult cases were noted in the notes column. See AI Usage below for disclosure.

Difficult labeling decisions (from actual dataset examples):

"Okay? The Wolves are winning the minutes he sits by 5 points per 100 possessions. It's not like he is playing badly, he is just not on some dominant run..." — Could be analysis (cites points per 100 possessions) or hot_take (argumentative pushback tone). Decided: analysis — the stat supports a structured counter-argument, not just a bold assertion.
"This is the ball movement that led to the best offensive rating in NBA history last year man" — Could be analysis (mentions offensive rating) or reaction (excited hype about a highlight). Decided: reaction — enthusiasm about ball movement, not an argument built on the stat.
"Check what the offensive rating was with Luka off the floor" — Could be analysis (asks reader to look at a stat) or reaction (dismissive one-liner in a thread argument). Decided: reaction — short reactive jab, not a structured analysis.

Fine-tuning pipeline

Setting	Value
Base model	`distilbert-base-uncased` (HuggingFace)
Platform	Google Colab, T4 GPU
Epochs	3
Learning rate	2e-5
Batch size	16
Max sequence length	256 tokens

Hyperparameter decision: I kept 3 training epochs rather than increasing to 5. With only ~147 training examples, more passes risk memorizing surface phrases (e.g. "offensive rating" → analysis) instead of learning discourse structure. Three epochs is the standard starting point for BERT-family fine-tuning on small datasets and completed training in under 15 minutes without overfitting signs on the validation set.

Baseline comparison

Model: Groq llama-3.3-70b-versatile (zero-shot, no task-specific training)

How results were collected: The notebook's classify_with_groq() function sent each of the 32 held-out test examples to Groq with temperature=0. The system prompt included all three label definitions, one example per label, the analysis/hot_take edge-case rule, and an instruction to respond with only the label name. Predictions were parsed against LABEL_MAP; unparseable responses were excluded from metrics.

Full classification prompt:

You are classifying Reddit comments from r/nba.
Assign each comment to exactly one of the following categories.

analysis: The post makes a structured argument backed by specific, verifiable evidence — statistics, historical comparisons, tactical observations, or film-watching detail — where the evidence would support the claim even if you removed the opinion framing.
Example: "Their offensive rating drops 8 points per 100 possessions when Jokic sits. Denver's bench simply can't run the same actions without his passing gravity."

hot_take: A bold, confident opinion stated without genuine supporting evidence; the post asserts rather than argues, even if it sounds informed or cites a single decorative stat.
Example: "The Lakers are never winning another championship with this roster. It's over."

reaction: An immediate emotional response to a specific event with little or no argument; the post is expressing a feeling in the moment rather than making a claim to be evaluated.
Example: "WHAT A DUNK OH MY GOD"

Edge case rule: If a post cites one cherry-picked stat with accusatory framing but no real argument, label it hot_take, not analysis.

Respond with ONLY the label name.
Do not explain your reasoning.

Valid labels:
analysis
hot_take
reaction

Evaluation report

Results from evaluation_results.json on the locked 32-example test set.

Overall accuracy

Model	Accuracy
Zero-shot baseline (Groq)	0.500
Fine-tuned DistilBERT	0.469
Difference	−0.031 (fine-tuning regression)

Fine-tuning did not beat the zero-shot baseline on accuracy. The general-purpose LLM with explicit label definitions slightly outperformed the domain-specific fine-tuned model.

Per-class metrics

Zero-shot baseline (Groq) — same 32-example test set; see Colab Section 5 classification report for full precision/recall/F1 per label.

Fine-tuned DistilBERT:

Label	Precision	Recall	F1	Support
analysis	0.42	0.91	0.57	11
hot_take	0.57	0.36	0.44	11
reaction	1.00	0.10	0.18	10
accuracy			0.47	32

The fine-tuned model over-predicts analysis (91% recall, 42% precision) and almost never predicts reaction (10% recall).

Confusion matrix (fine-tuned)

	Pred: analysis	Pred: hot_take	Pred: reaction
True: analysis	10	1	0
True: hot_take	7	4	0
True: reaction	7	2	1

The largest off-diagonal cells are hot_take → analysis (7) and reaction → analysis (7) — the model systematically confuses non-analysis posts that mention basketball statistics.

Wrong predictions (3 analyzed)

1. reaction → analysis (confidence: 0.37)

"This is the ball movement that led to the best offensive rating in NBA history last year man"

The post is excited hype about ball movement, not a structured argument. The model matched the phrase "offensive rating" to analysis because that term appears frequently in true analysis examples. It learned vocabulary rather than discourse structure.

2. analysis → hot_take (confidence: 0.36)

"Okay? The Wolves are winning the minutes he sits by 5 points per 100 possessions. It's not like he is playing badly..."

A borderline case — cites a specific stat in argumentative pushback. Per my edge-case rule this leans analysis, but the confrontational tone may have pushed the model toward hot_take. This reflects the genuinely ambiguous boundary between evidence-backed pushback and opinionated dismissal.

3. reaction → analysis (confidence: 0.35)

"Check what the offensive rating was with Luka off the floor"

A short, dismissive reply — not analysis, but a reactive jab. The model saw "offensive rating" and defaulted to analysis. Short posts that reference stats without actually arguing are a systematic failure mode.

Sample classifications

All examples below are from the held-out test set (fine-tuned model inference, Colab Section 4).

Post	True label	Predicted	Confidence	Correct?
"Record. Offensive rating. Net rating. Effective field goal percentage. True shooting percentage. Offensive efficiency. Defensive efficiency..."	analysis	analysis	~0.37	✅
"This is the ball movement that led to the best offensive rating in NBA history last year man"	reaction	analysis	0.37	❌
"Okay? The Wolves are winning the minutes he sits by 5 points per 100 possessions. It's not like he is playing badly..."	analysis	hot_take	0.36	❌
"Check what the offensive rating was with Luka off the floor"	reaction	analysis	0.35	❌
"I love this. Make it an EX meter of sorts. 3pt fgs are a time-banked opportunity."	reaction	hot_take	0.36	❌

Why the first correct prediction is reasonable: The post enumerates multiple advanced metrics (offensive rating, net rating, TS%, eFG%) as part of a structured comparison — not a single cherry-picked stat or emotional outburst. That matches the analysis definition: evidence that supports a broader argument.

Pattern across all predictions: Confidence stays low (~0.35–0.38) even on correct calls, suggesting the model is uncertain on this subjective boundary task.

Reflection: intended vs. learned

I intended the model to learn discourse structure — whether a post argues with evidence, asserts an opinion, or reacts emotionally. What it actually learned is closer to a stats-keyword detector: mentions of "offensive rating," "points per 100," and "defensive" strongly pull toward analysis, regardless of whether the post is hype (#1), a dismissive one-liner (#3), or a bold opinion with decorative stats. The model barely learned reaction (recall 0.10), likely because many r/nba reaction posts still use basketball terminology, blurring the boundary I defined. Fine-tuning also performed slightly worse than the zero-shot Groq baseline, suggesting that 210 rule-assisted labels introduced enough noise that DistilBERT learned spurious correlations rather than the intended distinctions.

Spec reflection

How the spec helped: The requirement to run a zero-shot baseline before fine-tuning was especially valuable — without it, 47% accuracy might have looked acceptable (well above the 33% random baseline). The side-by-side comparison revealed that fine-tuning actually regressed by 3 percentage points, which changed my conclusion from "the model works" to "the labels need cleaner annotation."

Where implementation diverged: The spec recommends reading each of 200+ posts carefully during annotation. I used rule-assisted pre-labeling to speed collection, which diverged from that intent. The resulting label noise (especially around posts that mention stats without truly analyzing) likely caused the model's analysis bias. If I redid the project, I would manually label a smaller pilot set first, measure inter-label agreement, and only then scale up.

AI usage

Label taxonomy drafting (Cursor): I asked Cursor to draft planning.md and the Groq classification prompt based on my community choice (r/nba). I revised the edge-case decision rules — especially the analysis vs. hot_take boundary for single-stat posts — before using them for annotation.
Rule-assisted pre-labeling (Python script): I used a heuristic labeling script (scripts/label_data.py) to assign initial labels from keyword patterns. I overrode cases that violated my decision rules and added 11 curated examples to balance the dataset. This is disclosed because the script's patterns (e.g. "offensive rating" → analysis) may have introduced the same spurious correlations the model later learned.
Failure analysis (Cursor): After Colab training, I pasted misclassified test examples into Cursor and asked it to identify systematic error patterns. It flagged the analysis-keyword bias, which I verified by re-reading all 17 wrong predictions in the confusion matrix.

Repository contents

planning.md                 # Label design, data plan, evaluation criteria
data/labeled_dataset.csv    # 210 labeled r/nba comments
evaluation_results.json     # Baseline vs fine-tuned metrics
confusion_matrix.png        # Fine-tuned model confusion matrix
COLAB_GUIDE.md              # Colab notebook copy-paste guide
scripts/                    # Data collection helpers

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
.gitignore		.gitignore
COLAB_GUIDE.md		COLAB_GUIDE.md
LICENSE		LICENSE
README.md		README.md
ai201_project3_takemeter_starter_clean.ipynb		ai201_project3_takemeter_starter_clean.ipynb
confusion_matrix.png		confusion_matrix.png
evaluation_results.json		evaluation_results.json
planning.md		planning.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TakeMeter

Evaluation summary

Community choice

Label taxonomy

`analysis`

`hot_take`

`reaction`

Annotated dataset

Fine-tuning pipeline

Baseline comparison

Evaluation report

Overall accuracy

Per-class metrics

Confusion matrix (fine-tuned)

Wrong predictions (3 analyzed)

Sample classifications

Reflection: intended vs. learned

Spec reflection

AI usage

Repository contents

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TakeMeter

Evaluation summary

Community choice

Label taxonomy

analysis

hot_take

reaction

Annotated dataset

Fine-tuning pipeline

Baseline comparison

Evaluation report

Overall accuracy

Per-class metrics

Confusion matrix (fine-tuned)

Wrong predictions (3 analyzed)

Sample classifications

Reflection: intended vs. learned

Spec reflection

AI usage

Repository contents

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`analysis`

`hot_take`

`reaction`

Packages