Skip to content

Feature: Workspace Text Extraction & Full-Text Page Search #6

Description

@YurMil

Currently, the search bar in the toolbar only matches document names, page labels, or page numbers. For professional workflows, users need to search for text keywords inside the PDF pages themselves and filter or highlight the pages that contain those matches.

Technical Proposal:

  1. In pdfjsReader.ts, add a helper to extract text from a page using:
    const textContent = await page.getTextContent();
    const textItems = textContent.items.map(item => item.str).join(" ");
  2. Index this text content during document ingestion in ingest.worker.ts and add it to the page entity schema in types.ts.
  3. Integrate a client-side search index (e.g., flexsearch or a simple keyword regex matcher) in the store selectors to filter the page grid.


This repo is using Opire - what does it mean? 👇
💵 Everyone can add rewards for this issue commenting /reward 100 (replace 100 with the amount).
🕵️‍♂️ If someone starts working on this issue to earn the rewards, they can comment /try to let everyone know!
🙌 And when they open the PR, they can comment /claim #6 either in the PR description or in a PR's comment.

🪙 Also, everyone can tip any user commenting /tip 20 @YurMil (replace 20 with the amount, and @YurMil with the user to tip).

📖 If you want to learn more, check out our documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions