Some medical and clinical corpora needs an adaption for experiments, e.g., from XML-files to TXT-files. This repository is a collection of scripts to transform some German and English clinical corpora to experiments. Now, only German language in focussed. Here is either a link to the resources or a reference to a script containing instructions on requirements and scripting, should the texts need to be converted into plaintext.
- 3000PA_J: Code of creation of Jena part of 300OPA code from MSDoc files to plain text
- GraSCCo
- JSynCC-2-CaseDescription
- JSynCC-2-CaseReport
- JSynCC-2-Discussion
- JSynCC-2-EmergencyReport
- JSynCC-2-OperativeReport
- GGPOnc-2
- JSynCC-2-PubMed
- MuchMore: crates a subsets of 7808 documents
- NTS_Animal_Reports
- Wikipedia_ICD-10
- Technical_Laymen : request the data from https://aclanthology.org/2020.lrec-1.759.pdf
- Keepha-Adr : request the data from https://aclanthology.org/2024.lrec-main.36/
- Lifeline : request the data from https://aclanthology.org/2022.lrec-1.388/
- 3000PA needs JAVA.
- Code for all the other corpora needs Python, see requirements.txt for further usages.
If you have further questions, do not hesitate to contact Christina Lohr.
Last edit: 2025/05/04.