Search engine made out of hopes
- Working indexing and searching.
- Use a faster, less prone to failure and concurrent database.
- For authentication
- For indexing
- Save queued URLs in a file on exit to recover indexing on restart.
- Indexing improvements
- Auto-queueing of URLs found on webpages.
- Verify
hrefvalues ARE links. - Transform relative links into absolute links.
- Understand why the database fails when auto-queuing websites with UTF-16 characters.
- Verify
- Saving indexed pages description and title.
- Solve an issue with data not being resetted on re-indexing.
- Better algorithm for page scoring based on content.
- Save page's language data.
- Scoring based on external links.
- Use robots.txt and sitemaps, allowing to only submit a domain name and a sitemap URL to the indexer and the bot will do everything by itself.
- Avoid indexing pages that returned a non-succesful HTTP code.
- Improve the search console.
- Add an option to make certain URLs trigger a queue by scanning useful URLs referred to it. Which would allow to not have infinite indexing loops with no meaningful content.
- Auto-queueing of URLs found on webpages.
- Search improvements
- Implement result pagination instead of the 100's result limit.
- Implement localization segregated search.
- Implement a better user experience to navigate through search results
- Implement page descriptions.
- Make the UI better.
- Quality of Life
- Streamline the development and integration processes of SPAs
- Debug
- Improve the way the
routes!macro is used with debug routes. - Prevent compiling to production with
debugfeatures enabled.
- Improve the way the
Sending a JSON-formatted list of URLs at /index starts an indexing process
for those URLs.
Indexing happens by giving a website a score for each word it contains, hence:
- A word have a score defined by
nas the number of occurences of this word. - A word present in the title has it's score multiplied by 20.
- A word present in the description has it's score multiplied by 8.
- A word present in a
porspantag has it's score multiplied by 1. - A word present in a
h1tag has it's score multiplied by 15. - A word present in a
h2tag has it's score multiplied by 10. - A word present in a
h3tag has it's score multiplied by 7. - A word present in a
h4tag has it's score multiplied by 5. - A word present in a
h5tag has it's score multiplied by 3.
Each word found is lowercased before processing, and word scoring for a specific website is stored in a SQL database such as, for the TABLE OF WORD X:
| URL | SCORE |
|---|---|
| www.google.com | 128 |
| 128.0.0.2 | 16 |
A Type-Token Ratio is also calculated and added to a table where data about the url is stored. It allows to have an idea of the page quality.
This technique is meant to be upgraded as it's not ideal, the next phase is to use hyperlinks when indexing websites to determine the domain score (which could play a role in finding the best results for a query)
Indexing will not index pages that didn't returned a succesful 2XX HTTP code.
Search queries are sent at /search, the q parameter contains the query string.
To get the best search results, the query string is decomposed in a list of
words. A "leaderboard" of matching websites is made and the score a website gets
at indexing for a specific word in the query gets added to it's matching score.
The Type-Token Ratio of each page modifies the final score of a page on the search results.
The server returns to the client a list of the matching results starting from the best one.
This technique is not the best because it means that search results accuracy depends on the query length.
The whole repository must be cloned. To start the whole infrastructure, the
start.sh script must be called.
To prepare the infrastructure to run, the prepare-apps.sh file must be run
to compile and set up all SPAs.
The following environment variables have to be provided:
- PG_DIESEL_URL: The
postgresql://url to the database. - VITE_SUPABASE_URL
- VITE_SUPABASE_KEY
- VITE_JOOGLE_API_ENDPOINT: Address of Joogle
- VITE_JOOGLE_API_ENDPOINT_DEV: Address of Joogle in dev contexts
- JWT_SECRET, VITE_JWT_SECRET: Secret used to verify JWT