Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ DOCUMENTS_DIR=./documents
# Screenshots directory (for goal statement screenshots)
SCREENSHOTS_DIR=./documents/screenshots

<<<<<<< Updated upstream
# External sources
PMA_TRUMP47_URL=https://www.performance.gov/trump47pma/
TREASURY_PRESS_RELEASES_URL=https://home.treasury.gov/news/press-releases
Expand All @@ -57,5 +58,10 @@ INGEST_MIN_INTERVAL_S=0.25
# (Back-compat) You can still set these directly, but then it's always enabled for that process:
# SSL_CERT_FILE=/absolute/path/to/apex/certs/merged-ca-bundle.pem
# REQUESTS_CA_BUNDLE=/absolute/path/to/apex/certs/merged-ca-bundle.pem
=======
# Optional corporate CA bundle
SSL_CERT_FILE=./certs/merged-ca-bundle.pem
REQUESTS_CA_BUNDLE=./certs/merged-ca-bundle.pem
>>>>>>> Stashed changes


3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ documents/rejected/
# Generated extraction/evaluation artifacts
runs/

# cursor plans
.cursor/plans/*.md

# Runtime logs
logs/*.log
logs/*.out
42 changes: 37 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,8 @@ For new strategic plans:
1. drop incoming PDFs into `documents/uploads`
2. register and rename them deterministically
3. scan them into the database
4. extract structure, provenance, and context
5. run QA and bounded retries
4. extract structure and provenance
5. tag goals, refresh summaries, and run QA
6. review them in the UI

The current batch helper is:
Expand All @@ -189,10 +189,24 @@ The current batch helper is:

That workflow:
- reads PDFs from `documents/uploads`
- infers metadata deterministically where possible
- moves files into the managed corpus folders
- infers agency metadata deterministically where possible
- moves files into the managed corpus folder (`documents/strategic_plans`)
- registers them in the database
- runs extraction, provenance, summaries, and QA
- runs extraction, tagging, summaries, and QA

When you are working with a fresh database, make sure the `agencies` table includes the agencies you plan to ingest. Agency inference and registration require a matching abbreviation or name. If Treasury is missing, insert it before running registration or extraction:

```bash
sqlite3 apex.db "insert into agencies (name, abbreviation) values ('Department of the Treasury', 'USDT');"
```

Run that command once to seed Treasury, then proceed with registration or uploads.

Naming expectations for uploads:
- include the agency abbreviation (preferred) or agency name in the filename so the agency can be inferred
- if the filename starts with the agency abbreviation, it will be treated as the primary hint
- example upload name: `DHS_2022_strategic_plan.pdf` or `Department_of_Homeland_Security_Strategic_Plan_2022.pdf`
- registered files are renamed to `AGENCY_<slugified-title>.pdf` in `documents/strategic_plans`

You can still use the lower-level registration helper directly when needed:

Expand All @@ -206,6 +220,24 @@ Or run extraction separately:
./venv/bin/python scripts/run_batch_pipeline.py --document-type strategic_plan
```

## Test Press Release Generation

Use the generator to insert simulated GSA press releases into the database for matching tests.
It reuses existing PMA goals/objectives in the database and writes a sidecar JSON file with
the expected matches.

```bash
./venv/bin/python scripts/mock/generation/seed_gsa_press_releases.py --match-rate 0.5 --count 15
```

Available parameters:
- `--match-rate {0.25,0.5,0.75,1.0}`: target portion of releases expected to match
- `--count 15`: number of press releases to create (default: 15)
- `--seed 1337`: deterministic objective selection seed
- `--output <path>`: sidecar JSON output location
- `--source gsa`: press release source code (default: gsa)
- `--url-prefix <prefix>`: URL prefix used for unique test URLs

## Review Workflow

The review UI is optimized for plan inspection, not agency dashboards.
Expand Down
4 changes: 2 additions & 2 deletions backend/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,8 +185,8 @@ def _database_url_from_env() -> str:
SCREENSHOTS_DIR: str = os.getenv("SCREENSHOTS_DIR", str(BASE_DIR / ".." / "documents" / "screenshots"))

# External sources (ingestion)
PMA_TRUMP47_URL: str = os.getenv("PMA_TRUMP47_URL", "").strip()
TREASURY_PRESS_RELEASES_URL: str = os.getenv("TREASURY_PRESS_RELEASES_URL", "").strip()
PMA_TRUMP47_URL: str = os.getenv("PMA_TRUMP47_URL", "https://www.performance.gov/trump47pma/").strip()
TREASURY_PRESS_RELEASES_URL: str = os.getenv("TREASURY_PRESS_RELEASES_URL", "https://home.treasury.gov/news/press-releases").strip()
HHS_PRESS_RELEASES_URL: str = os.getenv("HHS_PRESS_RELEASES_URL", "https://www.hhs.gov/press-room/index.html").strip()
STATE_PRESS_RELEASES_URL: str = os.getenv("STATE_PRESS_RELEASES_URL", "https://www.state.gov/press-releases/").strip()
JUSTICE_PRESS_RELEASES_URL: str = os.getenv("JUSTICE_PRESS_RELEASES_URL", "https://www.justice.gov/news/press-releases").strip()
Expand Down
17 changes: 16 additions & 1 deletion backend/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,21 @@

from backend.config import SCREENSHOTS_DIR
from backend.database import create_tables
from backend.routers import agencies, airtable, chat, crosswalk, documents, evaluation, extraction, objectives, plans, query, search, semantic
from backend.routers import (
agencies,
airtable,
chat,
crosswalk,
documents,
evaluation,
extraction,
objectives,
plans,
press_releases,
query,
search,
semantic,
)

logging.basicConfig(
level=logging.INFO,
Expand Down Expand Up @@ -88,6 +102,7 @@
app.include_router(semantic.router)
app.include_router(plans.router)
app.include_router(objectives.router)
app.include_router(press_releases.router)
app.include_router(crosswalk.router)


Expand Down
127 changes: 127 additions & 0 deletions backend/models/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
from backend.models.agency_create import AgencyCreate
from backend.models.agency_mention_response import AgencyMentionResponse
from backend.models.agency_profile_response import AgencyProfileResponse
from backend.models.agency_response import AgencyResponse
from backend.models.batch_extraction_response import BatchExtractionResponse
from backend.models.chat_request import ChatRequest
from backend.models.chat_response import ChatResponse
from backend.models.citation import Citation
from backend.models.document_response import DocumentResponse
from backend.models.extraction_status import ExtractionStatus
from backend.models.goal_detail_response import GoalDetailResponse
from backend.models.goal_objective_match_counts_response import GoalObjectiveMatchCountsResponse
from backend.models.goal_response import GoalResponse
from backend.models.measure_comparison import MeasureComparison
from backend.models.measure_response import MeasureResponse
from backend.models.measurement_instance_response import MeasurementInstanceResponse
from backend.models.objective_detail_response import ObjectiveDetailResponse
from backend.models.objective_list_item_response import ObjectiveListItemResponse
from backend.models.objective_match_count_item import ObjectiveMatchCountItem
from backend.models.objective_match_response import ObjectiveMatchResponse
from backend.models.objective_matches_response import ObjectiveMatchesResponse
from backend.models.objective_response import ObjectiveResponse
from backend.models.paginated_response import PaginatedResponse
from backend.models.plan_agency_summary import PlanAgencySummary
from backend.models.plan_goal_response import PlanGoalResponse
from backend.models.plan_objective_response import PlanObjectiveResponse
from backend.models.plan_response import PlanResponse
from backend.models.plan_summary_response import PlanSummaryResponse
from backend.models.plan_element_stakeholder_relation_response import (
PlanElementStakeholderRelationResponse,
)
from backend.models.press_release_detail_response import PressReleaseDetailResponse
from backend.models.press_release_list_item import PressReleaseListItem
from backend.models.press_release_match_attempt import PressReleaseMatchAttempt
from backend.models.press_release_source_summary import PressReleaseSourceSummary
from backend.models.qa_finding_response import QaFindingResponse
from backend.models.query_filters import QueryFilters
from backend.models.report_measure_fact_response import ReportMeasureFactResponse
from backend.models.search_response import SearchResponse
from backend.models.search_result_item import SearchResultItem
from backend.models.semantic_change_feed_response import SemanticChangeFeedResponse
from backend.models.semantic_edge_response import SemanticEdgeResponse
from backend.models.semantic_manifest_response import SemanticManifestResponse
from backend.models.semantic_neighbor_response import SemanticNeighborResponse
from backend.models.semantic_node_response import SemanticNodeResponse
from backend.models.semantic_tombstone_response import SemanticTombstoneResponse
from backend.models.shared_priority_edge_response import SharedPriorityEdgeResponse
from backend.models.stakeholder_relation_response import StakeholderRelationResponse
from backend.models.stakeholder_response import StakeholderResponse
from backend.models.strategic_plan_element_detail_response import (
StrategicPlanElementDetailResponse,
)
from backend.models.strategic_plan_element_response import StrategicPlanElementResponse
from backend.models.stored_citation_response import StoredCitationResponse
from backend.models.theme_agency import ThemeAgency
from backend.models.theme_detail_response import ThemeDetailResponse
from backend.models.theme_goal import ThemeGoal
from backend.models.theme_summary import ThemeSummary
from backend.models.topic_canonicalization_response import TopicCanonicalizationResponse
from backend.models.topic_canonicalization_upsert import TopicCanonicalizationUpsert
from backend.models.topic_curation_candidate_response import TopicCurationCandidateResponse
from backend.models.topic_detail_response import TopicDetailResponse
from backend.models.topic_summary_response import TopicSummaryResponse
from backend.models.topic_suggested_match_response import TopicSuggestedMatchResponse

__all__ = [
"AgencyCreate",
"AgencyMentionResponse",
"AgencyProfileResponse",
"AgencyResponse",
"BatchExtractionResponse",
"ChatRequest",
"ChatResponse",
"Citation",
"DocumentResponse",
"ExtractionStatus",
"GoalDetailResponse",
"GoalObjectiveMatchCountsResponse",
"GoalResponse",
"MeasureComparison",
"MeasureResponse",
"MeasurementInstanceResponse",
"ObjectiveDetailResponse",
"ObjectiveListItemResponse",
"ObjectiveMatchCountItem",
"ObjectiveMatchResponse",
"ObjectiveMatchesResponse",
"ObjectiveResponse",
"PaginatedResponse",
"PlanAgencySummary",
"PlanElementStakeholderRelationResponse",
"PlanGoalResponse",
"PlanObjectiveResponse",
"PlanResponse",
"PlanSummaryResponse",
"PressReleaseDetailResponse",
"PressReleaseListItem",
"PressReleaseMatchAttempt",
"PressReleaseSourceSummary",
"QaFindingResponse",
"QueryFilters",
"ReportMeasureFactResponse",
"SearchResponse",
"SearchResultItem",
"SemanticChangeFeedResponse",
"SemanticEdgeResponse",
"SemanticManifestResponse",
"SemanticNeighborResponse",
"SemanticNodeResponse",
"SemanticTombstoneResponse",
"SharedPriorityEdgeResponse",
"StakeholderRelationResponse",
"StakeholderResponse",
"StrategicPlanElementDetailResponse",
"StrategicPlanElementResponse",
"StoredCitationResponse",
"ThemeAgency",
"ThemeDetailResponse",
"ThemeGoal",
"ThemeSummary",
"TopicCanonicalizationResponse",
"TopicCanonicalizationUpsert",
"TopicCurationCandidateResponse",
"TopicDetailResponse",
"TopicSummaryResponse",
"TopicSuggestedMatchResponse",
]
12 changes: 12 additions & 0 deletions backend/models/agency_create.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from __future__ import annotations

from typing import Optional

from pydantic import BaseModel


class AgencyCreate(BaseModel):
name: str
abbreviation: Optional[str] = None
parent_agency_id: Optional[int] = None
is_cfo_act_agency: Optional[bool] = None
21 changes: 21 additions & 0 deletions backend/models/agency_mention_response.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from __future__ import annotations

from typing import Optional

from pydantic import BaseModel


class AgencyMentionResponse(BaseModel):
id: int
agency_id: int
mention_text: str
evidence_text: Optional[str] = None
source_page: Optional[int] = None
mention_role: Optional[str] = None
relationship_summary: Optional[str] = None
match_method: Optional[str] = None
confidence: Optional[str] = None
agency_name: Optional[str] = None
agency_abbreviation: Optional[str] = None

model_config = {"from_attributes": True}
28 changes: 28 additions & 0 deletions backend/models/agency_profile_response.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from __future__ import annotations

from typing import Optional

from pydantic import BaseModel

from backend.models.document_response import DocumentResponse
from backend.models.goal_response import GoalResponse


class AgencyProfileResponse(BaseModel):
id: int
name: str
abbreviation: Optional[str] = None
agency_code: Optional[str] = None
goal_count: int = 0
objective_count: int = 0
measure_count: int = 0
document_count: int = 0
measures_improving: int = 0
measures_declining: int = 0
measures_stable: int = 0
measures_unknown: int = 0
goals: list[GoalResponse] = []
documents: list[DocumentResponse] = []
themes: list[str] = []

model_config = {"from_attributes": True}
21 changes: 21 additions & 0 deletions backend/models/agency_response.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from __future__ import annotations

from datetime import datetime
from typing import Optional

from pydantic import BaseModel


class AgencyResponse(BaseModel):
id: int
name: str
abbreviation: Optional[str] = None
agency_code: Optional[str] = None
parent_agency_id: Optional[int] = None
parent_agency_name: Optional[str] = None
parent_agency_abbreviation: Optional[str] = None
is_cfo_act_agency: Optional[bool] = None
created_at: Optional[datetime] = None
goal_count: int = 0

model_config = {"from_attributes": True}
12 changes: 12 additions & 0 deletions backend/models/batch_extraction_response.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from __future__ import annotations

from typing import Optional

from pydantic import BaseModel


class BatchExtractionResponse(BaseModel):
queued_count: int = 0
skipped_count: int = 0
document_ids: list[int] = []
message: Optional[str] = None
11 changes: 11 additions & 0 deletions backend/models/chat_request.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from __future__ import annotations

from typing import Optional

from pydantic import BaseModel


class ChatRequest(BaseModel):
question: str
agency_id: Optional[int] = None
document_id: Optional[int] = None
10 changes: 10 additions & 0 deletions backend/models/chat_response.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from __future__ import annotations

from pydantic import BaseModel

from backend.models.citation import Citation


class ChatResponse(BaseModel):
answer: str
citations: list[Citation] = []
11 changes: 11 additions & 0 deletions backend/models/citation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from __future__ import annotations

from typing import Optional

from pydantic import BaseModel


class Citation(BaseModel):
source: Optional[str] = None
page: Optional[int] = None
text: Optional[str] = None
Loading