Skip to content

priyavisingh/TakeMeter

Repository files navigation

TakeMeter

A fine-tuned text classifier that evaluates discourse quality in r/nba — distinguishing analysis, hot takes, and reactions.

Evaluation summary

Metric Baseline (Groq) Fine-tuned (DistilBERT)
Test accuracy 50.0% 46.9%
Best per-class F1 analysis: 0.57
Worst per-class F1 reaction: 0.18
Main error pattern hot_take and reaction misclassified as analysis when posts mention stats

Fine-tuning did not beat the zero-shot baseline. See full report below.

Community choice

r/nba is Reddit's main NBA discussion community. Discourse there ranges from stat-heavy tactical breakdowns to emotional game reactions to bold unsupported opinions — distinctions that regulars actually care about. The subreddit is text-heavy, public, and large enough to collect 200+ diverse examples without authentication.


Label taxonomy

Three mutually exclusive labels grounded in how r/nba participants talk about post quality:

analysis

Definition: The post makes a structured argument backed by specific, verifiable evidence — statistics, historical comparisons, tactical observations, or film-watching detail — where the evidence would support the claim even if you removed the opinion framing.

Examples:

  • "Their offensive rating drops 8 points per 100 possessions when Jokic sits. Denver's bench simply can't run the same actions without his passing gravity."
  • "Since 2019, teams that switch everything on Brunson have held him below 40% FG in the playoffs. Miami's scheme in round 2 exploited exactly that."

hot_take

Definition: A bold, confident opinion stated without genuine supporting evidence; the post asserts rather than argues, even if it sounds informed or cites a single decorative stat.

Examples:

  • "The Lakers are never winning another championship with this roster. It's over."
  • "Refs absolutely decided that game. The league wants the Celtics in the Finals."

reaction

Definition: An immediate emotional response to a specific event with little or no argument; the post is expressing a feeling in the moment rather than making a claim to be evaluated.

Examples:

  • "WHAT A DUNK OH MY GOD"
  • "I'm actually sick. That was the worst officiating I've ever seen."

Full label design, edge-case rules, and additional examples are in planning.md.


Annotated dataset

Detail Value
Source Public r/nba comments via PullPush archive API
File data/labeled_dataset.csv
Total examples 210
Split 70% train / 15% val / 15% test (stratified, handled by Colab notebook)
Test set size 32

Label distribution:

Label Count %
analysis 70 33.3%
hot_take 73 34.8%
reaction 67 31.9%

Labeling process: Comments were collected with keyword-diverse PullPush queries, then pre-labeled with a rule-assisted script encoding the decision rules from planning.md. Each label was reviewed and difficult cases were noted in the notes column. See AI Usage below for disclosure.

Difficult labeling decisions (from actual dataset examples):

  1. "Okay? The Wolves are winning the minutes he sits by 5 points per 100 possessions. It's not like he is playing badly, he is just not on some dominant run..." — Could be analysis (cites points per 100 possessions) or hot_take (argumentative pushback tone). Decided: analysis — the stat supports a structured counter-argument, not just a bold assertion.

  2. "This is the ball movement that led to the best offensive rating in NBA history last year man" — Could be analysis (mentions offensive rating) or reaction (excited hype about a highlight). Decided: reaction — enthusiasm about ball movement, not an argument built on the stat.

  3. "Check what the offensive rating was with Luka off the floor" — Could be analysis (asks reader to look at a stat) or reaction (dismissive one-liner in a thread argument). Decided: reaction — short reactive jab, not a structured analysis.


Fine-tuning pipeline

Setting Value
Base model distilbert-base-uncased (HuggingFace)
Platform Google Colab, T4 GPU
Epochs 3
Learning rate 2e-5
Batch size 16
Max sequence length 256 tokens

Hyperparameter decision: I kept 3 training epochs rather than increasing to 5. With only ~147 training examples, more passes risk memorizing surface phrases (e.g. "offensive rating" → analysis) instead of learning discourse structure. Three epochs is the standard starting point for BERT-family fine-tuning on small datasets and completed training in under 15 minutes without overfitting signs on the validation set.


Baseline comparison

Model: Groq llama-3.3-70b-versatile (zero-shot, no task-specific training)

How results were collected: The notebook's classify_with_groq() function sent each of the 32 held-out test examples to Groq with temperature=0. The system prompt included all three label definitions, one example per label, the analysis/hot_take edge-case rule, and an instruction to respond with only the label name. Predictions were parsed against LABEL_MAP; unparseable responses were excluded from metrics.

Full classification prompt:

You are classifying Reddit comments from r/nba.
Assign each comment to exactly one of the following categories.

analysis: The post makes a structured argument backed by specific, verifiable evidence — statistics, historical comparisons, tactical observations, or film-watching detail — where the evidence would support the claim even if you removed the opinion framing.
Example: "Their offensive rating drops 8 points per 100 possessions when Jokic sits. Denver's bench simply can't run the same actions without his passing gravity."

hot_take: A bold, confident opinion stated without genuine supporting evidence; the post asserts rather than argues, even if it sounds informed or cites a single decorative stat.
Example: "The Lakers are never winning another championship with this roster. It's over."

reaction: An immediate emotional response to a specific event with little or no argument; the post is expressing a feeling in the moment rather than making a claim to be evaluated.
Example: "WHAT A DUNK OH MY GOD"

Edge case rule: If a post cites one cherry-picked stat with accusatory framing but no real argument, label it hot_take, not analysis.

Respond with ONLY the label name.
Do not explain your reasoning.

Valid labels:
analysis
hot_take
reaction

Evaluation report

Results from evaluation_results.json on the locked 32-example test set.

Overall accuracy

Model Accuracy
Zero-shot baseline (Groq) 0.500
Fine-tuned DistilBERT 0.469
Difference −0.031 (fine-tuning regression)

Fine-tuning did not beat the zero-shot baseline on accuracy. The general-purpose LLM with explicit label definitions slightly outperformed the domain-specific fine-tuned model.

Per-class metrics

Zero-shot baseline (Groq) — same 32-example test set; see Colab Section 5 classification report for full precision/recall/F1 per label.

Fine-tuned DistilBERT:

Label Precision Recall F1 Support
analysis 0.42 0.91 0.57 11
hot_take 0.57 0.36 0.44 11
reaction 1.00 0.10 0.18 10
accuracy 0.47 32

The fine-tuned model over-predicts analysis (91% recall, 42% precision) and almost never predicts reaction (10% recall).

Confusion matrix (fine-tuned)

Pred: analysis Pred: hot_take Pred: reaction
True: analysis 10 1 0
True: hot_take 7 4 0
True: reaction 7 2 1

Confusion matrix

The largest off-diagonal cells are hot_take → analysis (7) and reaction → analysis (7) — the model systematically confuses non-analysis posts that mention basketball statistics.

Wrong predictions (3 analyzed)

1. reactionanalysis (confidence: 0.37)

"This is the ball movement that led to the best offensive rating in NBA history last year man"

The post is excited hype about ball movement, not a structured argument. The model matched the phrase "offensive rating" to analysis because that term appears frequently in true analysis examples. It learned vocabulary rather than discourse structure.

2. analysishot_take (confidence: 0.36)

"Okay? The Wolves are winning the minutes he sits by 5 points per 100 possessions. It's not like he is playing badly..."

A borderline case — cites a specific stat in argumentative pushback. Per my edge-case rule this leans analysis, but the confrontational tone may have pushed the model toward hot_take. This reflects the genuinely ambiguous boundary between evidence-backed pushback and opinionated dismissal.

3. reactionanalysis (confidence: 0.35)

"Check what the offensive rating was with Luka off the floor"

A short, dismissive reply — not analysis, but a reactive jab. The model saw "offensive rating" and defaulted to analysis. Short posts that reference stats without actually arguing are a systematic failure mode.

Sample classifications

All examples below are from the held-out test set (fine-tuned model inference, Colab Section 4).

Post True label Predicted Confidence Correct?
"Record. Offensive rating. Net rating. Effective field goal percentage. True shooting percentage. Offensive efficiency. Defensive efficiency..." analysis analysis ~0.37
"This is the ball movement that led to the best offensive rating in NBA history last year man" reaction analysis 0.37
"Okay? The Wolves are winning the minutes he sits by 5 points per 100 possessions. It's not like he is playing badly..." analysis hot_take 0.36
"Check what the offensive rating was with Luka off the floor" reaction analysis 0.35
"I love this. Make it an EX meter of sorts. 3pt fgs are a time-banked opportunity." reaction hot_take 0.36

Why the first correct prediction is reasonable: The post enumerates multiple advanced metrics (offensive rating, net rating, TS%, eFG%) as part of a structured comparison — not a single cherry-picked stat or emotional outburst. That matches the analysis definition: evidence that supports a broader argument.

Pattern across all predictions: Confidence stays low (~0.35–0.38) even on correct calls, suggesting the model is uncertain on this subjective boundary task.

Reflection: intended vs. learned

I intended the model to learn discourse structure — whether a post argues with evidence, asserts an opinion, or reacts emotionally. What it actually learned is closer to a stats-keyword detector: mentions of "offensive rating," "points per 100," and "defensive" strongly pull toward analysis, regardless of whether the post is hype (#1), a dismissive one-liner (#3), or a bold opinion with decorative stats. The model barely learned reaction (recall 0.10), likely because many r/nba reaction posts still use basketball terminology, blurring the boundary I defined. Fine-tuning also performed slightly worse than the zero-shot Groq baseline, suggesting that 210 rule-assisted labels introduced enough noise that DistilBERT learned spurious correlations rather than the intended distinctions.

Spec reflection

How the spec helped: The requirement to run a zero-shot baseline before fine-tuning was especially valuable — without it, 47% accuracy might have looked acceptable (well above the 33% random baseline). The side-by-side comparison revealed that fine-tuning actually regressed by 3 percentage points, which changed my conclusion from "the model works" to "the labels need cleaner annotation."

Where implementation diverged: The spec recommends reading each of 200+ posts carefully during annotation. I used rule-assisted pre-labeling to speed collection, which diverged from that intent. The resulting label noise (especially around posts that mention stats without truly analyzing) likely caused the model's analysis bias. If I redid the project, I would manually label a smaller pilot set first, measure inter-label agreement, and only then scale up.

AI usage

  1. Label taxonomy drafting (Cursor): I asked Cursor to draft planning.md and the Groq classification prompt based on my community choice (r/nba). I revised the edge-case decision rules — especially the analysis vs. hot_take boundary for single-stat posts — before using them for annotation.

  2. Rule-assisted pre-labeling (Python script): I used a heuristic labeling script (scripts/label_data.py) to assign initial labels from keyword patterns. I overrode cases that violated my decision rules and added 11 curated examples to balance the dataset. This is disclosed because the script's patterns (e.g. "offensive rating" → analysis) may have introduced the same spurious correlations the model later learned.

  3. Failure analysis (Cursor): After Colab training, I pasted misclassified test examples into Cursor and asked it to identify systematic error patterns. It flagged the analysis-keyword bias, which I verified by re-reading all 17 wrong predictions in the confusion matrix.


Repository contents

planning.md                 # Label design, data plan, evaluation criteria
data/labeled_dataset.csv    # 210 labeled r/nba comments
evaluation_results.json     # Baseline vs fine-tuned metrics
confusion_matrix.png        # Fine-tuned model confusion matrix
COLAB_GUIDE.md              # Colab notebook copy-paste guide
scripts/                    # Data collection helpers

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors