-
Notifications
You must be signed in to change notification settings - Fork 0
Deterministic common names; performance improvement; higher rank coverage #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… ranks as well; convert common names tooling to Polars
|
Common names will now be deterministically chosen only from the GBIF Backbone Taxonomy data (exactly as provided), and none allowed through from input data. The prioritization is: If no English name is available at a rank, it returns the first name of any language available. If no name of any language is available, it will do this language prioritization at the parent rank. So, it provides the first available English or other language name for the most specific taxon rank in the resolution that a common name exists for, up to kingdom, providing the most specific common name available based on this prioritization. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses issue #10 by making common name resolution deterministic while also implementing significant performance improvements and extending common name coverage to higher taxonomic ranks. The changes convert the data processing pipeline from Pandas to Polars for better performance, establish deterministic prioritization of vernacular names (English first, then other languages), and implement hierarchical common name lookup from species up to kingdom level.
Key changes:
- Performance: Migration from Pandas to Polars for data processing
- Deterministic Resolution: English vernacular names are consistently prioritized, with fallback to other languages
- Extended Coverage: Common names now resolved for all taxonomic ranks from species to kingdom
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/taxonopy/resolve_common_names.py |
Core implementation converted to Polars with hierarchical common name lookup and deterministic vernacular prioritization |
tests/test_resolve_common_names.py |
Comprehensive test suite covering all new functions with edge cases and integration tests |
.github/workflows/run-tests.yaml |
CI workflow configuration for automated testing across multiple Python versions |
vimar-gu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! We'll depend on taxonomic labels to determine the species.
|
Is it actually useful to have "Orchid" repeated as the same common name for dozens of different orchid species, for example? Also I'm surprised the common name for the family is in singular, not plural; is this really true? |
This is a key question. It could be? For the example dozens of orchid species that do not have any common name in any language available, it may be useful to know that they are all instances of the family "Orchid." Alternatives could be:
@hlapp what do you think?
We currently take the entry exactly as-is from the |
|
I remain unconvinced that us making up our own common names according to some bespoke algorithm, and in the absence of a concrete use-case driving the rationale and thus solution, is a good or even useful thing. That's assuming that having "orchid" as the made-up vernacular name isn't needed to find this record when looking for all orchids. (And if it were needed, we should be rethinking our search index and metadata.) |
|
But at the same time, we also have many images with taxonomic labels only specific to the kingdom level. This would be a similar situation to what @hlapp described, as the images are attached with the kingdom label that is shared by all the other images under this kingdom. |
Correct, providing common names is not necessary for any indexing. This brings the discussion to a positive framing of the utility a common name feature should offer.
If "Hawaiian Goose" was unavailable, seeing just the Latin might feel opaque. By providing the common name available at the closest rank, the user could still feel more comfortable with the result: With this, a user can instantly map a Latin string to a more familiar concept.
So using common names helps performance, although this used species level common names only.
Although it would be important to test this, and thus important to know when a common name was being reported for a coarser level than the resolved taxonomy. Proposed change: This would make it easy to exclude or include those names not matching the most specific taxonomic rank use when desired. |
|
I'd be OK with at least adding the rank from which a common name is if it's not from the same rank as the latin name. I will still point out that what we call use-cases here are made up by us, instead of being responsive to what users have actually asked for or run into trouble with. And to that point, I really don't think that showing "Animals" next to Branta sandvicensis is more useful, and arguably less so, than not showing a common name at all. Perhaps for closer ranks this is much less obvious. But it is blurring the semantic border between actually predicting with confidence the species, and, for example, only predicting with confidence the genus, or the family, which would be correctly conveyed by giving the common name of the genus and family, respectively, instead of of the species. |
To address #10, as well as: