feat: expand moderation categories#163
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2cca2be. Configure here.
|
Okay, so according to the OpenAI docs, they only support image classification for sexual, violence, and self-harm. They don't do it for drugs or hate 😭 So going to put this on the shelf until they release a more up-to-date model |

Overview
Expand the moderation endpoint from 2 categories (sexual, violence) to 5 by adding hate, self-harm, and drugs. Both OpenAI and Hive providers now extract and return the new scores. Also fixes a bogus Hive category (
garm_death_injury_or_military_conflict) that never matched anything, and movesyes_self_harm/yes_emaciated_bodyout of the violence bucket into their own self-harm category.What was changed
src/workflows/moderation.ts— Addedhate,selfHarm,drugsfields toThumbnailModerationScore,ModerationResult, andModerationOptionsinterfaces. AddedHIVE_HATE_CATEGORIES,HIVE_SELF_HARM_CATEGORIES,HIVE_ILLICIT_CATEGORIESconstant arrays. Removedgarm_death_injury_or_military_conflictfromHIVE_VIOLENCE_CATEGORIES(invalid class name from Hive's GARM model, not the visual moderation model). Movedyes_self_harm/yes_emaciated_bodyfrom violence to self-harm. OpenAI extraction now readshate,self-harm, andillicitfromcategory_scores(mapped todrugsin our API).exceedsThresholdnow checks all 5 categories.tests/unit/moderation.test.ts— Snapshot tests for all 5 Hive category arrays.tests/eval/moderation.eval.ts— Response-integrity scorer and eval columns updated for new fields.docs/API.md— Documented new threshold options and response fields.Suggested review order
src/workflows/moderation.ts— types and Hive category constants (lines 30-190)src/workflows/moderation.ts— OpenAI extraction changes (search forcategoryScores.hate)src/workflows/moderation.ts— Hive extraction and aggregation (search forHIVE_HATE_CATEGORIES)src/workflows/moderation.ts—getModerationScoresmax score + threshold logic (bottom of file)tests/unit/moderation.test.tsdocs/API.mdtests/eval/moderation.eval.tsNote
Medium Risk
Expands the
getModerationScoresAPI surface and thresholding logic from 2 to 5 categories, which can break downstream consumers expecting the old schema and changes what content gets flagged. Provider-specific category mapping tweaks (especially Hive category buckets) also risk behavior changes in production moderation results.Overview
getModerationScoresnow returns and thresholds five moderation categories (addshate,selfHarm,drugsalongsidesexual/violence) for both OpenAI and Hive providers, including updatedmaxScores, per-sample scores, andexceedsThresholdevaluation.Hive category handling is refactored into explicit
HIVE_HATE_CATEGORIES,HIVE_SELF_HARM_CATEGORIES, andHIVE_ILLICIT_CATEGORIES, removing an invalid category and moving self-harm-related classes out of the violence bucket. Docs and eval/unit tests are updated to reflect the new fields and category constants.Reviewed by Cursor Bugbot for commit a22dd6f. Bugbot is set up for automated code reviews on this repo. Configure here.