Skip to content

QREPO-193 create StructuredTextExtractionFilter, add it to filter.plu…#1

Draft
msutya wants to merge 5 commits into
qulto-7.6.1from
QREPO-193
Draft

QREPO-193 create StructuredTextExtractionFilter, add it to filter.plu…#1
msutya wants to merge 5 commits into
qulto-7.6.1from
QREPO-193

Conversation

@msutya

@msutya msutya commented Mar 20, 2024

Copy link
Copy Markdown
Collaborator

…gins, add jackson-dataformat-xml dependency

References

Add references/links to any related issues or PRs. These may include:

  • Fixes #issue-number (if this fixes an issue ticket)
  • Related to DSpace/RestContract#pr-number (if a corresponding REST Contract PR exists)

Description

Short summary of changes (1-2 sentences).

Instructions for Reviewers

Please add a more detailed description of the changes made by your PR. At a minimum, providing a bulleted list of changes in your PR is helpful to reviewers.

List of changes in this PR:

  • First, ...
  • Second, ...

Include guidance for how to test or review your PR. This may include: steps to reproduce a bug, screenshots or description of a new feature, or reasons behind specific changes.

Checklist

This checklist provides a reminder of what we are going to look for when reviewing your PR. You need not complete this checklist prior to creating your PR (draft PRs are always welcome). If you are unsure about an item in the checklist, don't hesitate to ask. We're here to help!

  • My PR is small in size (e.g. less than 1,000 lines of code, not including comments & integration tests). Exceptions may be made if previously agreed upon.
  • My PR passes Checkstyle validation based on the Code Style Guide.
  • My PR includes Javadoc for all new (or modified) public methods and classes. It also includes Javadoc for large or complex private methods.
  • My PR passes all tests and includes new/updated Unit or Integration Tests based on the Code Testing Guide.
  • If my PR includes new libraries/dependencies (in any pom.xml), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.
  • If my PR modifies REST API endpoints, I've opened a separate REST Contract PR related to this change.
  • If my PR includes new configurations, I've provided basic technical documentation in the PR itself.
  • If my PR fixes an issue ticket, I've linked them together.

…gins, add jackson-dataformat-xml dependency
@msutya msutya requested a review from kanasznagyzoltan March 20, 2024 16:33
@msutya msutya marked this pull request as draft March 20, 2024 16:34
…date dependencies, exclude dependencies that clash with others
Comment thread dspace-api/pom.xml Outdated
return new ByteArrayInputStream(extractedText.getBytes(StandardCharsets.UTF_8));
}

return null;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Először egy kicsit fennakadt a szemen azon, hogy "null"-t adunk vissza. Viszont látom, hogy a használó kód vizsgálja, hogy az adott konkrét Plugin-tól visszakapott InputStream null-e, szóval ezen nem változtatnék plusz a DSpace alapműködése is már erre lett építve.

String extractedText;

PDDocument document = PDDocument.load(source);
Splitter splitter = new Splitter();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kiszervezhető egy field-be az osztályon belül.

resultStream.close();
}

private InputStream getMultiPagePDF() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inkább egy olyan függvényt célszerű használni erre a célra, mint pl.: loadPdf(String filename) és konkrétan ráhívni a megfelelő fájlnévvel.


@Test
public void testGetFilteredName() {
StructuredTextExtractionFilter filter = new StructuredTextExtractionFilter();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kiszervezhető a példányosítás akár @before függvénybe, ami lefut minden testfüggvény előtt vagy ha igazából lehet belőle egyetlen egy példány mint a valóságban is, akkor igazából lehet a filter osztályváltozó is.

InputStream resultStream = filter.getDestinationStream(item, getMultiPagePDF(), true);

assertNotNull(resultStream);
byte[] bytes = new byte[100];

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Az egyezést vhogy máshogy kellene checkolni, ebbe biztos bele fognak kötni.
Nekem egy ilyen ötletem van:

  • Az expected értéket xml-ként resource file-ban tárolni az eredeti PDF mellett
  • Felolvasni Java objektumba az Expected és a Ténylegesen extractolt értéket és ha a megfelelő equals metódus implementálva van, akkor össze lehet hasonlítani assertEquals-szal

import org.dspace.app.mediafilter.model.Pages;
import org.dspace.content.Item;

public class StructuredTextExtractionFilter extends MediaFilter {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gondolkoztam rajta, de szerintem a nevén majd módosítanunk kell, nem fogunk tudni oldalalapú fulltext-et adni csak a PDF dokumentumokhoz, más típusokhoz pl.: Word típusú fájlokhoz vagy pl.: txt fájlhoz sem.
Szerintem jobb neve lenne StructuredPdfTextExtractionFilter, mert így akkor neve utalni fog arra, hogy PDF-hez készült.
Azt lenne jó megnézni, hogy a kódban lehet-e vhogy jelezni, hogy ez a Filter csak "application/pdf" mimetype-ú vagy pedig PDF típusú fájlokra képest lefutni. (DSpace ne futtassa más típusú fájlokra.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Az ősosztályban van egy olyan függvény amit erre lehetne használni, ha override-olásra kerül.
/**
* Perform any pre-processing of the source bitstream before the actual
* filtering takes place in MediaFilterManager.processBitstream().
*


* Return true if pre-processing is successful (or no pre-processing
* is necessary). Return false if bitstream should be skipped
* for any reason.
*
* @param c context
* @param item item containing bitstream to process
* @param source source bitstream to be processed
* @param verbose verbose mode
* @return true if bitstream processing should continue,
* false if this bitstream should be skipped
* @throws Exception if error
*/
@OverRide
public boolean preProcessBitstream(Context c, Item item, Bitstream source, boolean verbose)
throws Exception {
return true; //default to no pre-processing
}

…sion, rename classes, use xml file as expected result
Comment thread dspace-api/pom.xml Outdated
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-xml</artifactId>
<version>${dataformat.version}</version>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tegyük a property név elé hogy jackson -> jackson-dataformat.version


if (useTemporaryFile) {
return extractUsingTempFile(source, verbose);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sokat szépítene a dolgon, ha ennek az if-nek lenne egy else ága, ami pl.: ráhívna egy olyan függvényre mint pl.: extractUsingMemory() és akkor jobban olvashatóvá válna ez a függvény, nagyon sokat rövidülne.

}

try (FileWriter writer = new FileWriter(tempExtractedTextFile, StandardCharsets.UTF_8)) {
ContentHandlerDecorator handler = new BodyContentHandler(new ContentHandlerDecorator() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ez a kiíratás a Tika-hoz kapcsolódik, az XmlMapper, ami az osztályodban van nem tudja kiírni xml-ként egy adott outputstreamre?

kanasznagyzoltan pushed a commit that referenced this pull request Jun 17, 2025
[DURACOM-316] added IT for new coar fixes impl.
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
activity in one year. It will be closed in 14 days if no further activity occurs.

Allowing pull requests to close as stale helps us filter out old work that is no longer
relevant and helps developers focus on reviewing current work.

All pull requests closed by this bot act like normal pull requests;
they can be searched for, commented on or reopened at any point.

If these changes are still relevant then please comment and/or rebase your PR based on the
latest DSpace code. This will remove the stale status and notify us to assign a reviewer
for your PR.

@github-actions github-actions Bot added the stale label Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants