Deterministic common names; performance improvement; higher rank coverage #14

thompsonmj · 2025-07-23T15:25:59Z

To address #10, as well as:

improve speed by converting from Pandas to Polars
include common names for ranks higher than genus and species
add testing for common names resolution

… ranks as well; convert common names tooling to Polars

thompsonmj · 2025-07-24T15:11:31Z

Common names will now be deterministically chosen only from the GBIF Backbone Taxonomy data (exactly as provided), and none allowed through from input data.

The prioritization is: If no English name is available at a rank, it returns the first name of any language available. If no name of any language is available, it will do this language prioritization at the parent rank.

So, it provides the first available English or other language name for the most specific taxon rank in the resolution that a common name exists for, up to kingdom, providing the most specific common name available based on this prioritization.

For example,

{
  "scientific_name": "Bulbophyllum polliculosum",
  "kingdom": "Plantae",
  "phylum": "Tracheophyta",
  "class": "Liliopsida",
  "order": "Asparagales",
  "family": "Orchidaceae", # first English name available is "Orchid"
  "genus": "Bulbophyllum", # no common name available in any language
  "species": "Bulbophyllum polliculosum", # no common name available in any language
  "common_name": "Orchid"
}

Copilot

Pull Request Overview

This PR addresses issue #10 by making common name resolution deterministic while also implementing significant performance improvements and extending common name coverage to higher taxonomic ranks. The changes convert the data processing pipeline from Pandas to Polars for better performance, establish deterministic prioritization of vernacular names (English first, then other languages), and implement hierarchical common name lookup from species up to kingdom level.

Key changes:

Performance: Migration from Pandas to Polars for data processing
Deterministic Resolution: English vernacular names are consistently prioritized, with fallback to other languages
Extended Coverage: Common names now resolved for all taxonomic ranks from species to kingdom

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`src/taxonopy/resolve_common_names.py`	Core implementation converted to Polars with hierarchical common name lookup and deterministic vernacular prioritization
`tests/test_resolve_common_names.py`	Comprehensive test suite covering all new functions with edge cases and integration tests
`.github/workflows/run-tests.yaml`	CI workflow configuration for automated testing across multiple Python versions

src/taxonopy/resolve_common_names.py

tests/test_resolve_common_names.py

src/taxonopy/resolve_common_names.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

vimar-gu

This looks good! We'll depend on taxonomic labels to determine the species.

hlapp · 2025-07-28T21:23:39Z

Is it actually useful to have "Orchid" repeated as the same common name for dozens of different orchid species, for example?

Also I'm surprised the common name for the family is in singular, not plural; is this really true?

thompsonmj · 2025-07-29T21:43:15Z

Is it actually useful to have "Orchid" repeated as the same common name for dozens of different orchid species, for example?

This is a key question. It could be? For the example dozens of orchid species that do not have any common name in any language available, it may be useful to know that they are all instances of the family "Orchid."

Alternatives could be:

return no common name for an organism where a common name is not available for the most specific rank.
return a hierarchy of common names down to as specific as possible (matched against the taxonomic hierarchy entries)
detect when common name is coarse compared to the resolved rank and append "a {resolved_rank} in the {vernacular_name} {vernacular_rank}", e.g. "a species in the Orchid family".
provide CLI options to specify which of the above is desired

@hlapp what do you think?

Also I'm surprised the common name for the family is in singular, not plural; is this really true?

We currently take the entry exactly as-is from the VernacularName.tsv supplied by the GBIF backbone. So in this case, yes the family is singular ("Orchidaceae" is "taxonID": 7689 in Taxon.tsv-> 7689 Orchid en United Kingdom Species Inventory (UKSI) in VernacularName.tsv). It might not always be the case though; I haven't looked into the consistency in plurality vs singularity for higher rank common names.

hlapp · 2025-07-30T01:23:20Z

I remain unconvinced that us making up our own common names according to some bespoke algorithm, and in the absence of a concrete use-case driving the rationale and thus solution, is a good or even useful thing.

That's assuming that having "orchid" as the made-up vernacular name isn't needed to find this record when looking for all orchids. (And if it were needed, we should be rethinking our search index and metadata.)

vimar-gu · 2025-07-30T03:22:56Z

But at the same time, we also have many images with taxonomic labels only specific to the kingdom level. This would be a similar situation to what @hlapp described, as the images are attached with the kingdom label that is shared by all the other images under this kingdom.

thompsonmj · 2025-07-30T17:30:52Z

That's assuming that having "orchid" as the made-up vernacular name isn't needed to find this record when looking for all orchids.

Correct, providing common names is not necessary for any indexing.

This brings the discussion to a positive framing of the utility a common name feature should offer.

Simplest: providing a reader a sense of familiarity / accessibility / approachability when seeing an open-ended classification result from e.g. pybioclip (which uses resolved data from this tool for its text embeddings). For example, suppose pybioclip provides an image classification:

Animalia,Chordata,Aves,Anseriformes,Anatidae,Branta,sandvicensis

If "Hawaiian Goose" was unavailable, seeing just the Latin might feel opaque. By providing the common name available at the closest rank, the user could still feel more comfortable with the result:
- Missing "Hawaiian Goose", show "Black Geese" (genus)
- No genus name? Show "Ducks, Geese, Swans" (family)
- No family? Show "Waterfowl" (order)
- likewise up to showing "Animals" (kingdom)
All drawn exactly from the GBIF data, rather than making anything up.

With this, a user can instantly map a Latin string to a more familiar concept.

Research related: Common names are included in training.
From the original BioCLIP paper, a summary quote on incorporating mixed labels (taxonomic + common):

These results indicate that mixed text type pre-training largely retains the generalization benefits of using taxonomic names while also providing flexibility of different text types for inference, an important property for a foundation model that may be used for diverse downstream tasks.

So using common names helps performance, although this used species level common names only.
While the impact of coarser common names alongside fine-grained taxonomy is not yet tested, a hypothesis could be that coarser vernacular labels (terms already widespread in OpenCLIP's pretraining) would anchor these concepts and empower zero-shot performance while new granular taxonomic detail improves discriminatory ability.

Intuitively, common names of organisms are most pervasive in the training data of CLIP and OpenCLIP and these models
work best with common names.

Although it would be important to test this, and thus important to know when a common name was being reported for a coarser level than the resolved taxonomy.

Proposed change:
To support both of these use-cases, we could add a column to indicate the rank of the provided common name.

This would make it easy to exclude or include those names not matching the most specific taxonomic rank use when desired.

hlapp · 2025-07-30T22:25:40Z

I'd be OK with at least adding the rank from which a common name is if it's not from the same rank as the latin name.

I will still point out that what we call use-cases here are made up by us, instead of being responsive to what users have actually asked for or run into trouble with. And to that point, I really don't think that showing "Animals" next to Branta sandvicensis is more useful, and arguably less so, than not showing a common name at all. Perhaps for closer ranks this is much less obvious. But it is blurring the semantic border between actually predicting with confidence the species, and, for example, only predicting with confidence the genus, or the family, which would be correctly conveyed by giving the common name of the genus and family, respectively, instead of of the species.

thompsonmj added 7 commits July 22, 2025 20:28

Deterministic common names, English prioritized; get names for higher…

cd7b248

… ranks as well; convert common names tooling to Polars

Modularize helper functions

6f20c18

Test common name functionality

ffa0322

Run tests with Actions workflow

aee8c7a

Address issues identified running on real data

fe0ccd8

Dry code; sync tests

4e6ec07

Opt to preserve source data capitalization vs title case

a4ef5ae

thompsonmj marked this pull request as ready for review July 24, 2025 15:11

thompsonmj requested review from Copilot and vimar-gu July 24, 2025 15:11

Copilot AI reviewed Jul 24, 2025

View reviewed changes

src/taxonopy/resolve_common_names.py Outdated Show resolved Hide resolved

tests/test_resolve_common_names.py Outdated Show resolved Hide resolved

src/taxonopy/resolve_common_names.py Outdated Show resolved Hide resolved

src/taxonopy/resolve_common_names.py Outdated Show resolved Hide resolved

thompsonmj and others added 4 commits July 24, 2025 13:14

Fix typeo

ca8d0a2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix typo

a59f63e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix typo

c6d7b86

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix typo

1302e46

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

vimar-gu approved these changes Jul 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deterministic common names; performance improvement; higher rank coverage #14

Deterministic common names; performance improvement; higher rank coverage #14

Uh oh!

thompsonmj commented Jul 23, 2025

Uh oh!

thompsonmj commented Jul 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vimar-gu left a comment

Uh oh!

hlapp commented Jul 28, 2025

Uh oh!

thompsonmj commented Jul 29, 2025 •

edited

Loading

Uh oh!

hlapp commented Jul 30, 2025

Uh oh!

vimar-gu commented Jul 30, 2025

Uh oh!

thompsonmj commented Jul 30, 2025 •

edited

Loading

Uh oh!

hlapp commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Deterministic common names; performance improvement; higher rank coverage #14

Are you sure you want to change the base?

Deterministic common names; performance improvement; higher rank coverage #14

Uh oh!

Conversation

thompsonmj commented Jul 23, 2025

Uh oh!

thompsonmj commented Jul 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vimar-gu left a comment

Choose a reason for hiding this comment

Uh oh!

hlapp commented Jul 28, 2025

Uh oh!

thompsonmj commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hlapp commented Jul 30, 2025

Uh oh!

vimar-gu commented Jul 30, 2025

Uh oh!

thompsonmj commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hlapp commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thompsonmj commented Jul 29, 2025 •

edited

Loading

thompsonmj commented Jul 30, 2025 •

edited

Loading