Skip to content

Document content type filter #5

@ribenamaplesyrup

Description

@ribenamaplesyrup

If we download all files from an organisation's website we are likely to encounter a mix of documents including publications, job ads, workplace policies, newsletters etc. If we wanted to only analyse one type of document (for example publications) we should add a step to filter out the documents we're not interested in. As I see it there are two approaches we could take to filter out irrelevant documents:

Filter by parent link

Only retrieve/analyse documents found on webpages that contain a specific base url.

Filter using LLM

Only retrieve/analyse documents that an LLM identifies as a research publication.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions