-
Scraping
- Scraped ~7000 documents using
https://en.wikipedia.org/wiki/Science_fiction_filmas a seed usingBeautifulSoup - Customizable depth
- Duplicate detection
- Saved in
.jsonformat withparagraphs,table of contents,urlandtitleas fields
- Scraped ~7000 documents using
-
Tokenization
- Standard tokenizer
- Token filters:
stop,lowercase,snowball stemmer
-
Support for
BM25andJelinek-MercerLanguage Model -
Retrieval of top
krelevant documents in order -
Support for
conjunctiveanddisjunctivequeries -
User interface with the following features
Dropdown keyword suggestionsbased on Levenstein distance using Fuzzy searchSnippetsthat displays the most relevant fragments built usingunified highlighter- Interface to change between the models and modes as per user's requirements
- Displaying results as clickable links for better access
python3 run.py