Skip to content

Fixes #5600 az_search_api document extraction#5610

Merged
bberndt-uaz merged 6 commits into
mainfrom
issue/5600
Jun 17, 2026
Merged

Fixes #5600 az_search_api document extraction#5610
bberndt-uaz merged 6 commits into
mainfrom
issue/5600

Conversation

@tadean

@tadean tadean commented May 14, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds usage of document extraction to the search_api index, so that PDF contents will be searchable. This is done using search_api_attachments and a searchstax document extraction endpoint.

Release notes

az_search_api will now index the contents of document files attached to index entities.

Related issues

#5600

How to test

Note: to test this PR you need the endpoint and token for a searchstax environment. You can look these up for a searchstax environment with the following steps:

  • Login at searchstudio.searchstax.com
  • Select or create a Drupal site search endpoint
  • Click App Settings
  • Click All APIs
  • Use Update Endpoint as the value below of SEARCH_ENVIRONMENT_UPDATE_ENDPOINT
  • Copy the Read & Write token to use as the value below for SEARCH_ENVIRONMENT_UPDATE_TOKEN
  • Copy the url in the Document Extractor tab for AZ_SEARCH_API_EXTRACTOR_ENDPOINT
  • Copy the Discovery API key under the Discovery tab for AZ_SEARCH_API_EXTRACTOR_TOKEN

If testing locally:

Note: You must complete this step before starting your lando container!
Create a .env containing:

  • a secret called AZ_SEARCH_API_TOKEN with a value equal to JSON matching the following format:
{"update_endpoint": "SEARCH_ENVIRONMENT_UPDATE_ENDPOINT", "update_token": "SEARCH_ENVIRONMENT_UPDATE_TOKEN"}
  • a secret called AZ_SEARCH_API_EXTRACTOR_ENDPOINT with the URL of the searchstax extraction endpoint
  • a secret called AZ_SEARCH_API_EXTRACTOR_TOKEN with the authentication token of the searchstax extraction endpoint
    In Lando, the env file should look like this:
AZ_SEARCH_API_TOKEN={"update_endpoint": "SEARCH_ENVIRONMENT_UPDATE_ENDPOINT", "update_token": "SEARCH_ENVIRONMENT_UPDATE_TOKEN"}
AZ_SEARCH_API_EXTRACTOR_TOKEN=yourauthtoken
AZ_SEARCH_API_EXTRACTOR_ENDPOINT=https://url.to.your.endpoint.com/extract

If testing on Pantheon:

  • enable Pantheon Secrets module
  • create a Pantheon secret for the environment called AZ_SEARCH_API_TOKEN matching the JSON schema above
  • create a Pantheon secret for the environment called AZ_SEARCH_API_EXTRACTOR_TOKEN with the extractor token
  • create a Pantheon secret for the environment called AZ_SEARCH_API_EXTRACTOR_ENDPOINT with the extractor endpoint
  • wait several minutes for the environment secret to provision to the running environment

Now that secrets are configured:

  • enable az_search_api Do not do this without first configuring the secrets described above
  • Login to the site as an administrator
  • Visit /admin/config/search/search-api/server/az_search_api_searchstax
    • look for the message The Solr server could be reached for Server Connection
  • Create a Person node. Give them an uploaded PDF CV with known text.
  • Visit /admin/config/search/search-api/index/az_search_api_index
  • Click Clear all indexed data
  • Click Queue all items for reindexing
  • Click Index now
  • Verify the first batch completed without error
  • Wait several minutes. It can take time for the searchstax server to commit the transaction.
  • Check for documents in the index, or look at the index in the Searchstax dashboard
  • Verify extracted text appears in the indexed document

You can manually check the contents of the index via this curl command:

curl -H "Authorization: Token SEARCH_ENVIRONMENT_SELECT_TOKEN" "SEARCH_ENVIRONMENT_SELECT_ENDPOINT?q=*:*"

Types of changes

Arizona Quickstart (install profile, custom modules, custom theme)

  • Patch release changes
    • Bug fix
    • Accessibility, performance, or security improvement
    • Critical institutional link or brand change
    • Adding experimental module
    • Update experimental module
  • Minor release changes
    • New feature
    • Breaking or visual change to existing behavior
    • Upgrade experimental module to stable
    • Enable existing module by default or database update
    • Non-critical brand change
    • New internal API or API improvement with backwards compatibility
    • Risky or disruptive cleanup to comply with coding standards
    • High-risk or disruptive change (requires upgrade path, risks regression, etc.)
  • Other or unknown
    • Other or unknown

Drupal core

  • Patch release changes
    • Security update
    • Patch level release (non-security bug-fix release)
    • Patch removal that's no longer necessary
  • Minor release changes
    • Major or minor level update
  • Other or unknown
    • Other or unknown

Drupal contrib projects

  • Patch release changes
    • Security update
    • Patch or minor level update
    • Add new module
    • Patch removal that's no longer necessary
  • Minor release changes
    • Major level update
  • Other or unknown
    • Other or unknown

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • My change requires release notes.

@tadean tadean self-assigned this May 14, 2026
@tadean tadean requested review from a team as code owners May 14, 2026 22:27
@tadean tadean added enhancement New feature or request high priority Must get done for this milestone patch release Issues to be included in the next patch release integrations This relates to an integration into a central service. backport-2.x Changes to be back-ported to the 2.x development branch labels May 14, 2026
@tadean

tadean commented May 14, 2026

Copy link
Copy Markdown
Contributor Author
  • Determine which file types to exclude
  • Determine which file fields should be included? (in particular, fields not known to quickstart?)
  • Depending on above, we may need to create an extra processor that collates all file fields into a single field in the index
  • Database update to enable search api attachments for existing sites

@joeparsons joeparsons moved this from Todo to In Progress in 3.3.7 bug-fix patch release May 15, 2026
@joeparsons joeparsons moved this to In Progress in 3.4.0-alpha2 May 15, 2026
@bberndt-uaz bberndt-uaz moved this to In Progress in 3.4.0 minor release May 20, 2026
@tadean

tadean commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

Updated with an extractor plugin not linked to a particular field and added db update.

@bberndt-uaz bberndt-uaz moved this from Todo to In Progress in 3.4.1 bug-fix patch release May 22, 2026
@tadean tadean moved this from In Progress to Needs review in 3.3.7 bug-fix patch release May 29, 2026
@tadean tadean moved this from In Progress to Needs review in 3.4.1 bug-fix patch release May 29, 2026
@tadean

tadean commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

Relocating this to 3.4.1 for additional time to review.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Searchstax-powered document text extraction to the az_search_api Search API index so attached PDFs (and other supported files) contribute searchable text via search_api_attachments.

Changes:

  • Introduces a custom Search API processor (az_attachments) extending search_api_attachments extraction behavior (including Media->File resolution).
  • Adds Search API index configuration to store extracted attachment text in a dedicated field and include it in fulltext searching.
  • Adds Searchstax extractor endpoint/token Key entities + Key config overrides, and wires in the search_api_attachments dependency (module + composer).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
modules/custom/az_search_api/src/Plugin/search_api/processor/AZAttachment.php New processor to extract/index file contents (including Media-referenced files).
modules/custom/az_search_api/config/quickstart/search_api_attachments.admin_config.yml Provides Quickstart override defaults for search_api_attachments extraction (Searchstax method).
modules/custom/az_search_api/config/install/search_api.index.az_search_api_index.yml Adds an extracted attachments field and includes it in fulltext fields; configures processor settings.
modules/custom/az_search_api/config/install/key.key.az_search_api_extractor_token.yml Adds Key entity for the extractor auth token.
modules/custom/az_search_api/config/install/key.key.az_search_api_extractor_endpoint.yml Adds Key entity for the extractor endpoint.
modules/custom/az_search_api/config/install/key.config_override.az_search_api_extractor_token.yml Overrides search_api_attachments config token from Key/secret.
modules/custom/az_search_api/config/install/key.config_override.az_search_api_extractor_endpoint.yml Overrides search_api_attachments config endpoint from Key/secret.
modules/custom/az_search_api/az_search_api.install Adds an update hook to ensure search_api_attachments is installed on existing sites.
modules/custom/az_search_api/az_search_api.info.yml Declares search_api_attachments as a module dependency.
composer.json Adds drupal/search_api_attachments as a Composer requirement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

trackleft
trackleft previously approved these changes Jun 12, 2026
@joeparsons joeparsons added the needs-CWS-testing Needs manual pre/post release testing by Campus Web Services label Jun 12, 2026
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
/**
* Ensure search_api_attachments module is installed.
*/
function az_search_api_update_1021401() {

@joeparsons joeparsons Jun 16, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DB update was successfully run on a Quickstart 3 Pantheon multidev that I tested this PR on. I think the fact that we haven't created any other DB updates for this experimental module probably means that we don't need to bother with having different hook_update_N() schema versions that correspond with different Quickstart versions in this particular case. If we did want to do that I think we'd have to use this new core API: New API to mark database updates as equivalent | Drupal.org

@joeparsons joeparsons left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on Pantheon multidev and verified that:

  • search_api_attachements module was installed automatically by database update
  • Nodes containing fid values in file or media reference fields (e.g. Person nodes with attached CV documents) had their files indexed successfully
  • Content of attached files is indexed by SearchStax (@tadean helped me verify this)
  • Searching for phrases contained in attached documents on SearchStax yields the parent node in search results

One potential follow up that @tadean and I discussed would be to also include files that are embedded via Entity Embed (e.g. in text fields on Paragraphs). (This PR does not do that currently).

@tadean and I also agreed that it doesn't make sense to index files that are manually linked to in Node content (which this PR does not do).

@github-project-automation github-project-automation Bot moved this from Needs review to Ready to merge in 3.4.1 bug-fix patch release Jun 17, 2026
@bberndt-uaz bberndt-uaz merged commit 2fd9d2a into main Jun 17, 2026
32 of 33 checks passed
@bberndt-uaz bberndt-uaz deleted the issue/5600 branch June 17, 2026 17:19
@github-project-automation github-project-automation Bot moved this from Ready to merge to Done in 3.4.1 bug-fix patch release Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-2.x Changes to be back-ported to the 2.x development branch enhancement New feature or request high priority Must get done for this milestone integrations This relates to an integration into a central service. needs-CWS-testing Needs manual pre/post release testing by Campus Web Services patch release Issues to be included in the next patch release

Projects

Development

Successfully merging this pull request may close these issues.

5 participants