Skip to content

Invalid structMap produced #1195

Description

@mikegerber

This might be a bug in ocrd-cis actually, so beware.

We encountered a number of problems elsewhere due to an invalid physical structMap. Here, I managed to reproduce with the latest ocrd:all/maximum Docker image, with the following steps:

  1. Starting with the workspace here: https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/
  2. I removed all filegroups except OCR-D-IMG and OCR-D-GT-SEG-LINE, using ocrd workspace remove-group -rf. → After this, the structMap is OK!
  3. Then I ran ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN → After this, the structMap is INVALID

Invalid structMap (multiple divs with same ID) after step 2, shortened to one physical page for emphasis:

  <mets:structMap TYPE="PHYSICAL">
    <mets:div TYPE="physSequence">
      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-IMG_1879_45_0344"/>
        <mets:fptr FILEID="OCR-D-GT-SEG-LINE_1879_45_0344"/>
      </mets:div>

       ...

      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-BIN_1879_45_0344.IMG-BIN"/>
      </mets:div>
  
       ...

    </mets:div>
  </mets:structMap>

(I'll upload the full data in the comments)

This causes all kind of breakage all over the place.

What I didn't check yet: if this only breaks with ocrd_cis, maybe @bertsky can share his debugging efforts here. I first had the impression that this breaks with add too, but as I had tried to reproduce a problem encountered by @stweil in OCR-D/quiver-benchmarks#22 it could have always been in ocrd_cis (specific workflow uses this as first step) and I could have easily confused something.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions