-
Notifications
You must be signed in to change notification settings - Fork 52
Set waterdata data types #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves data type handling in the waterdata module by replacing the _cleanup_cols function with a more comprehensive _type_cols function. The key motivation is that the previous approach using .dt.date for daily data did not return proper datetime objects.
Key changes:
- Refactored
_cleanup_colsto_type_colswith expanded type handling - Added support for datetime columns (time, datetime, last_modified) using
pd.to_datetime() - Added categorical type conversion for identifier and status columns
- Removed the
serviceparameter dependency, making the function more generic
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
dataretrieval/waterdata/utils.py
Outdated
| df[col] = pd.to_numeric(df[col], errors="coerce") | ||
| cols = set(df.columns) | ||
| numerical_cols = ["value", "contributing_drainage_area"] | ||
| time_cols = ["time", "datetime", "last_modified"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a "datetime" column in any of the endpoints? I went through all of them (including continuous) and I don't see it. However, I do see some others:
Time series metadata: begin, end, begin_utc, end_utc
Monitoring locations: construction_date
And some additional numerics:
Monitoring locations: altitude, altitude_accuracy, drainage_area, well_constructed_depth, hole_constructed_depth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of a datetime column, but see no harm in including it.
| - The 'value' and 'contributing_drainage_area' columns are coerced to numeric types. | ||
| """ | ||
| if "time" in df.columns and service == "daily": | ||
| df["time"] = pd.to_datetime(df["time"]).dt.date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of the function originally was just changing the "time" column to simply a date (not a timestamp) for the daily values endpoint only, so that the user wouldn't be confused about whether the value represents a daily aggregated value (min, max, mean, etc.) or a particular measurement. This logic was initially introduced to match what R dataRetrieval was doing: https://github.com/DOI-USGS/dataRetrieval/blob/develop/R/walk_pages.R#L141
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It was my recollection that a dt.date lacks datetime functionality, furthermore, the parsing behavior of pd.to_datetime seems to have changed. By default, it correctly omits the time information. Maybe this was a pandas update, but in the current version, it seems correct to leave "time" as a datetime object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not seeing that, or I am misunderstanding. When I run your branch and pull from get_latest_daily, the "time" column shows up as "2025-12-01 00:00:00", whereas in the existing implementation, it shows up as "2025-12-01".
check, md = waterdata.get_latest_daily(
monitoring_location_id="USGS-05129115",
parameter_code="00060"
)
I like the existing implementation for daily summaries only, because the date cannot be confused with a singular measurement, and it indeed represents a "summary" value. However, if it causes problems by being inconsisent in the daily summary services, I'm open to applying a consistent rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're correct, dt.date does lack datetime functionality. It changes it to an object. Hm. It does make sense to give it a datetime type. Nevermind. We might then just want to say that the additional "00:00:00" added to it doesn't represent a singular value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did it display 00:00:00? it didn't for me, so this behavior was probably changed at some version of pandas.
Co-authored-by: Elise Hinman <121896266+ehinman@users.noreply.github.com>
ehinman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now all I can think of is "lgtm". Approved.
This is an initial PR to imrove data type handling within the
waterdatamodule.Simply calling
pd.to_datetime(df['time'])appears to correctly handle daily data.