Skip to content

Conversation

@thompsonmj
Copy link
Contributor

To address #10, as well as:

  • improve speed by converting from Pandas to Polars
  • include common names for ranks higher than genus and species
  • add testing for common names resolution

@thompsonmj
Copy link
Contributor Author

Common names will now be deterministically chosen only from the GBIF Backbone Taxonomy data (exactly as provided), and none allowed through from input data.

The prioritization is: If no English name is available at a rank, it returns the first name of any language available. If no name of any language is available, it will do this language prioritization at the parent rank.

So, it provides the first available English or other language name for the most specific taxon rank in the resolution that a common name exists for, up to kingdom, providing the most specific common name available based on this prioritization.

For example,

{
  "scientific_name": "Bulbophyllum polliculosum",
  "kingdom": "Plantae",
  "phylum": "Tracheophyta",
  "class": "Liliopsida",
  "order": "Asparagales",
  "family": "Orchidaceae", # first English name available is "Orchid"
  "genus": "Bulbophyllum", # no common name available in any language
  "species": "Bulbophyllum polliculosum", # no common name available in any language
  "common_name": "Orchid"
}

@thompsonmj thompsonmj marked this pull request as ready for review July 24, 2025 15:11
@thompsonmj thompsonmj requested review from Copilot and vimar-gu July 24, 2025 15:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses issue #10 by making common name resolution deterministic while also implementing significant performance improvements and extending common name coverage to higher taxonomic ranks. The changes convert the data processing pipeline from Pandas to Polars for better performance, establish deterministic prioritization of vernacular names (English first, then other languages), and implement hierarchical common name lookup from species up to kingdom level.

Key changes:

  • Performance: Migration from Pandas to Polars for data processing
  • Deterministic Resolution: English vernacular names are consistently prioritized, with fallback to other languages
  • Extended Coverage: Common names now resolved for all taxonomic ranks from species to kingdom

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/taxonopy/resolve_common_names.py Core implementation converted to Polars with hierarchical common name lookup and deterministic vernacular prioritization
tests/test_resolve_common_names.py Comprehensive test suite covering all new functions with edge cases and integration tests
.github/workflows/run-tests.yaml CI workflow configuration for automated testing across multiple Python versions

thompsonmj and others added 4 commits July 24, 2025 13:14
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Collaborator

@vimar-gu vimar-gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! We'll depend on taxonomic labels to determine the species.

@hlapp
Copy link
Member

hlapp commented Jul 28, 2025

Is it actually useful to have "Orchid" repeated as the same common name for dozens of different orchid species, for example?

Also I'm surprised the common name for the family is in singular, not plural; is this really true?

@thompsonmj
Copy link
Contributor Author

thompsonmj commented Jul 29, 2025

Is it actually useful to have "Orchid" repeated as the same common name for dozens of different orchid species, for example?

This is a key question. It could be? For the example dozens of orchid species that do not have any common name in any language available, it may be useful to know that they are all instances of the family "Orchid."

Alternatives could be:

  • return no common name for an organism where a common name is not available for the most specific rank.
  • return a hierarchy of common names down to as specific as possible (matched against the taxonomic hierarchy entries)
  • detect when common name is coarse compared to the resolved rank and append "a {resolved_rank} in the {vernacular_name} {vernacular_rank}", e.g. "a species in the Orchid family".
  • provide CLI options to specify which of the above is desired

@hlapp what do you think?

Also I'm surprised the common name for the family is in singular, not plural; is this really true?

We currently take the entry exactly as-is from the VernacularName.tsv supplied by the GBIF backbone. So in this case, yes the family is singular ("Orchidaceae" is "taxonID": 7689 in Taxon.tsv-> 7689 Orchid en United Kingdom Species Inventory (UKSI) in VernacularName.tsv). It might not always be the case though; I haven't looked into the consistency in plurality vs singularity for higher rank common names.

@hlapp
Copy link
Member

hlapp commented Jul 30, 2025

I remain unconvinced that us making up our own common names according to some bespoke algorithm, and in the absence of a concrete use-case driving the rationale and thus solution, is a good or even useful thing.

That's assuming that having "orchid" as the made-up vernacular name isn't needed to find this record when looking for all orchids. (And if it were needed, we should be rethinking our search index and metadata.)

@vimar-gu
Copy link
Collaborator

But at the same time, we also have many images with taxonomic labels only specific to the kingdom level. This would be a similar situation to what @hlapp described, as the images are attached with the kingdom label that is shared by all the other images under this kingdom.

@thompsonmj
Copy link
Contributor Author

thompsonmj commented Jul 30, 2025

That's assuming that having "orchid" as the made-up vernacular name isn't needed to find this record when looking for all orchids.

Correct, providing common names is not necessary for any indexing.

This brings the discussion to a positive framing of the utility a common name feature should offer.

  • Simplest: providing a reader a sense of familiarity / accessibility / approachability when seeing an open-ended classification result from e.g. pybioclip (which uses resolved data from this tool for its text embeddings). For example, suppose pybioclip provides an image classification:
Animalia,Chordata,Aves,Anseriformes,Anatidae,Branta,sandvicensis

If "Hawaiian Goose" was unavailable, seeing just the Latin might feel opaque. By providing the common name available at the closest rank, the user could still feel more comfortable with the result:
- Missing "Hawaiian Goose", show "Black Geese" (genus)
- No genus name? Show "Ducks, Geese, Swans" (family)
- No family? Show "Waterfowl" (order)
- likewise up to showing "Animals" (kingdom)
All drawn exactly from the GBIF data, rather than making anything up.

With this, a user can instantly map a Latin string to a more familiar concept.

  • Research related: Common names are included in training.
    From the original BioCLIP paper, a summary quote on incorporating mixed labels (taxonomic + common):

These results indicate that mixed text type pre-training largely retains the generalization benefits of using taxonomic names while also providing flexibility of different text types for inference, an important property for a foundation model that may be used for diverse downstream tasks.

So using common names helps performance, although this used species level common names only.
While the impact of coarser common names alongside fine-grained taxonomy is not yet tested, a hypothesis could be that coarser vernacular labels (terms already widespread in OpenCLIP's pretraining) would anchor these concepts and empower zero-shot performance while new granular taxonomic detail improves discriminatory ability.

Intuitively, common names of organisms are most pervasive in the training data of CLIP and OpenCLIP and these models
work best with common names.

Although it would be important to test this, and thus important to know when a common name was being reported for a coarser level than the resolved taxonomy.

Proposed change:
To support both of these use-cases, we could add a column to indicate the rank of the provided common name.

This would make it easy to exclude or include those names not matching the most specific taxonomic rank use when desired.

@hlapp
Copy link
Member

hlapp commented Jul 30, 2025

I'd be OK with at least adding the rank from which a common name is if it's not from the same rank as the latin name.

I will still point out that what we call use-cases here are made up by us, instead of being responsive to what users have actually asked for or run into trouble with. And to that point, I really don't think that showing "Animals" next to Branta sandvicensis is more useful, and arguably less so, than not showing a common name at all. Perhaps for closer ranks this is much less obvious. But it is blurring the semantic border between actually predicting with confidence the species, and, for example, only predicting with confidence the genus, or the family, which would be correctly conveyed by giving the common name of the genus and family, respectively, instead of of the species.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants