Fixes #5600 az_search_api document extraction#5610
Conversation
|
|
Updated with an extractor plugin not linked to a particular field and added db update. |
|
Relocating this to |
There was a problem hiding this comment.
Pull request overview
Adds Searchstax-powered document text extraction to the az_search_api Search API index so attached PDFs (and other supported files) contribute searchable text via search_api_attachments.
Changes:
- Introduces a custom Search API processor (
az_attachments) extendingsearch_api_attachmentsextraction behavior (including Media->File resolution). - Adds Search API index configuration to store extracted attachment text in a dedicated field and include it in fulltext searching.
- Adds Searchstax extractor endpoint/token Key entities + Key config overrides, and wires in the
search_api_attachmentsdependency (module + composer).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| modules/custom/az_search_api/src/Plugin/search_api/processor/AZAttachment.php | New processor to extract/index file contents (including Media-referenced files). |
| modules/custom/az_search_api/config/quickstart/search_api_attachments.admin_config.yml | Provides Quickstart override defaults for search_api_attachments extraction (Searchstax method). |
| modules/custom/az_search_api/config/install/search_api.index.az_search_api_index.yml | Adds an extracted attachments field and includes it in fulltext fields; configures processor settings. |
| modules/custom/az_search_api/config/install/key.key.az_search_api_extractor_token.yml | Adds Key entity for the extractor auth token. |
| modules/custom/az_search_api/config/install/key.key.az_search_api_extractor_endpoint.yml | Adds Key entity for the extractor endpoint. |
| modules/custom/az_search_api/config/install/key.config_override.az_search_api_extractor_token.yml | Overrides search_api_attachments config token from Key/secret. |
| modules/custom/az_search_api/config/install/key.config_override.az_search_api_extractor_endpoint.yml | Overrides search_api_attachments config endpoint from Key/secret. |
| modules/custom/az_search_api/az_search_api.install | Adds an update hook to ensure search_api_attachments is installed on existing sites. |
| modules/custom/az_search_api/az_search_api.info.yml | Declares search_api_attachments as a module dependency. |
| composer.json | Adds drupal/search_api_attachments as a Composer requirement. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** | ||
| * Ensure search_api_attachments module is installed. | ||
| */ | ||
| function az_search_api_update_1021401() { |
There was a problem hiding this comment.
This DB update was successfully run on a Quickstart 3 Pantheon multidev that I tested this PR on. I think the fact that we haven't created any other DB updates for this experimental module probably means that we don't need to bother with having different hook_update_N() schema versions that correspond with different Quickstart versions in this particular case. If we did want to do that I think we'd have to use this new core API: New API to mark database updates as equivalent | Drupal.org
joeparsons
left a comment
There was a problem hiding this comment.
Tested on Pantheon multidev and verified that:
search_api_attachementsmodule was installed automatically by database update- Nodes containing
fidvalues in file or media reference fields (e.g. Person nodes with attached CV documents) had their files indexed successfully - Content of attached files is indexed by SearchStax (@tadean helped me verify this)
- Searching for phrases contained in attached documents on SearchStax yields the parent node in search results
One potential follow up that @tadean and I discussed would be to also include files that are embedded via Entity Embed (e.g. in text fields on Paragraphs). (This PR does not do that currently).
@tadean and I also agreed that it doesn't make sense to index files that are manually linked to in Node content (which this PR does not do).
Description
This PR adds usage of document extraction to the search_api index, so that PDF contents will be searchable. This is done using
search_api_attachmentsand a searchstax document extraction endpoint.Release notes
Related issues
#5600
How to test
Note: to test this PR you need the endpoint and token for a searchstax environment. You can look these up for a searchstax environment with the following steps:
If testing locally:
Note: You must complete this step before starting your lando container!
Create a
.envcontaining:AZ_SEARCH_API_TOKENwith a value equal to JSON matching the following format:AZ_SEARCH_API_EXTRACTOR_ENDPOINTwith the URL of the searchstax extraction endpointAZ_SEARCH_API_EXTRACTOR_TOKENwith the authentication token of the searchstax extraction endpointIn Lando, the env file should look like this:
If testing on Pantheon:
AZ_SEARCH_API_TOKENmatching the JSON schema aboveAZ_SEARCH_API_EXTRACTOR_TOKENwith the extractor tokenAZ_SEARCH_API_EXTRACTOR_ENDPOINTwith the extractor endpointNow that secrets are configured:
az_search_apiDo not do this without first configuring the secrets described above/admin/config/search/search-api/server/az_search_api_searchstax/admin/config/search/search-api/index/az_search_api_indexYou can manually check the contents of the index via this curl command:
Types of changes
Arizona Quickstart (install profile, custom modules, custom theme)
Drupal core
Drupal contrib projects
Checklist