You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Martin Holmes edited this page Sep 13, 2023
·
4 revisions
Meeting notes
2023-09-13
We have decided to allow any namespace in the searchable collection as long as it has a specified prefix in the root of your config file, or the xpath-default-namespace attribute is specified on the root of the config file.
The tokenizer span should then be in the staticSearch namespace.
We should consider forking after checking well-formedness to allow for the use of a jar file to make the HTML well-formed. https://htmlcleaner.sourceforge.net/ is a good option, being a single open-source jar we could include in our repo.
We also discussed issues #219 and #246, realizing that rather than multiple passes through the document which have to be in a particular order, and therefore can't easily make good decisions about what to prioritize and what to drop, we should use a single pass with an accumulator building a profile of the element, and then have a decision function at the end which enables clear specification of the algorithm in a single location.