-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
If we download all files from an organisation's website we are likely to encounter a mix of documents including publications, job ads, workplace policies, newsletters etc. If we wanted to only analyse one type of document (for example publications) we should add a step to filter out the documents we're not interested in. As I see it there are two approaches we could take to filter out irrelevant documents:
Filter by parent link
Only retrieve/analyse documents found on webpages that contain a specific base url.
Filter using LLM
Only retrieve/analyse documents that an LLM identifies as a research publication.