This project is the accumulation of several objectives and challenges to implement what is intended to be an Information Retrieval system, more precisely an engine indexing and search of unstructured information. Initially it will be a system for indexing information and data contained in text files.
- Read a Corpus
- Tokening
- Stemming
- Filter by Stop Words
- Indexing by flexible Rules defined in the code
The main motivation for this project is to develop a desire to learn, plan and overcome new challenges, this case study and apply the concepts and philosophies acquired in the field of Information Retrieval.
- Resolve problems that involve Regex Pattern involving hyphens between char and numbers
- Query Module
- Perform Information Retrieval calculations and tests
- Unit Testing
- Error handling
- Documentation / Wiki
- Increase Performance
- Generate a library from this code
Thanks to Daniel Santos for helping with several contributions on the code.