Skip to content

feat: expand moderation categories#163

Open
walkerfrankenberg wants to merge 2 commits into
mainfrom
wf/moderation-categories
Open

feat: expand moderation categories#163
walkerfrankenberg wants to merge 2 commits into
mainfrom
wf/moderation-categories

Conversation

@walkerfrankenberg

@walkerfrankenberg walkerfrankenberg commented Apr 8, 2026

Copy link
Copy Markdown
Collaborator

Overview

Expand the moderation endpoint from 2 categories (sexual, violence) to 5 by adding hate, self-harm, and drugs. Both OpenAI and Hive providers now extract and return the new scores. Also fixes a bogus Hive category (garm_death_injury_or_military_conflict) that never matched anything, and moves yes_self_harm/yes_emaciated_body out of the violence bucket into their own self-harm category.

What was changed

  • src/workflows/moderation.ts — Added hate, selfHarm, drugs fields to ThumbnailModerationScore, ModerationResult, and ModerationOptions interfaces. Added HIVE_HATE_CATEGORIES, HIVE_SELF_HARM_CATEGORIES, HIVE_ILLICIT_CATEGORIES constant arrays. Removed garm_death_injury_or_military_conflict from HIVE_VIOLENCE_CATEGORIES (invalid class name from Hive's GARM model, not the visual moderation model). Moved yes_self_harm/yes_emaciated_body from violence to self-harm. OpenAI extraction now reads hate, self-harm, and illicit from category_scores (mapped to drugs in our API). exceedsThreshold now checks all 5 categories.
  • tests/unit/moderation.test.ts — Snapshot tests for all 5 Hive category arrays.
  • tests/eval/moderation.eval.ts — Response-integrity scorer and eval columns updated for new fields.
  • docs/API.md — Documented new threshold options and response fields.

Suggested review order

  1. src/workflows/moderation.ts — types and Hive category constants (lines 30-190)
  2. src/workflows/moderation.ts — OpenAI extraction changes (search for categoryScores.hate)
  3. src/workflows/moderation.ts — Hive extraction and aggregation (search for HIVE_HATE_CATEGORIES)
  4. src/workflows/moderation.tsgetModerationScores max score + threshold logic (bottom of file)
  5. tests/unit/moderation.test.ts
  6. docs/API.md
  7. tests/eval/moderation.eval.ts

Note

Medium Risk
Expands the getModerationScores API surface and thresholding logic from 2 to 5 categories, which can break downstream consumers expecting the old schema and changes what content gets flagged. Provider-specific category mapping tweaks (especially Hive category buckets) also risk behavior changes in production moderation results.

Overview
getModerationScores now returns and thresholds five moderation categories (adds hate, selfHarm, drugs alongside sexual/violence) for both OpenAI and Hive providers, including updated maxScores, per-sample scores, and exceedsThreshold evaluation.

Hive category handling is refactored into explicit HIVE_HATE_CATEGORIES, HIVE_SELF_HARM_CATEGORIES, and HIVE_ILLICIT_CATEGORIES, removing an invalid category and moving self-harm-related classes out of the violence bucket. Docs and eval/unit tests are updated to reflect the new fields and category constants.

Reviewed by Cursor Bugbot for commit a22dd6f. Bugbot is set up for automated code reviews on this repo. Configure here.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@snyk-io

snyk-io Bot commented Apr 8, 2026

Copy link
Copy Markdown

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@walkerfrankenberg walkerfrankenberg requested a review from a team April 8, 2026 21:32

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2cca2be. Configure here.

Comment thread src/workflows/moderation.ts

@daniel-hayes daniel-hayes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@walkerfrankenberg

Copy link
Copy Markdown
Collaborator Author

Okay, so according to the OpenAI docs, they only support image classification for sexual, violence, and self-harm. They don't do it for drugs or hate 😭 So going to put this on the shelf until they release a more up-to-date model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants