Skip to content

Conversation

@thodson-usgs
Copy link
Collaborator

@thodson-usgs thodson-usgs commented Dec 1, 2025

This is an initial PR to imrove data type handling within the waterdata module.
Simply calling pd.to_datetime(df['time']) appears to correctly handle daily data.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves data type handling in the waterdata module by replacing the _cleanup_cols function with a more comprehensive _type_cols function. The key motivation is that the previous approach using .dt.date for daily data did not return proper datetime objects.

Key changes:

  • Refactored _cleanup_cols to _type_cols with expanded type handling
  • Added support for datetime columns (time, datetime, last_modified) using pd.to_datetime()
  • Added categorical type conversion for identifier and status columns
  • Removed the service parameter dependency, making the function more generic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@thodson-usgs thodson-usgs marked this pull request as draft December 1, 2025 19:47
@thodson-usgs thodson-usgs changed the title Set waterdata data types WIP: Set waterdata data types Dec 1, 2025
@thodson-usgs thodson-usgs changed the title WIP: Set waterdata data types Set waterdata data types Dec 1, 2025
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@thodson-usgs thodson-usgs requested a review from ehinman December 1, 2025 20:02
@thodson-usgs thodson-usgs self-assigned this Dec 1, 2025
@thodson-usgs thodson-usgs added the bug Something isn't working label Dec 1, 2025
@thodson-usgs thodson-usgs marked this pull request as ready for review December 1, 2025 20:05
df[col] = pd.to_numeric(df[col], errors="coerce")
cols = set(df.columns)
numerical_cols = ["value", "contributing_drainage_area"]
time_cols = ["time", "datetime", "last_modified"]
Copy link
Collaborator

@ehinman ehinman Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a "datetime" column in any of the endpoints? I went through all of them (including continuous) and I don't see it. However, I do see some others:

Time series metadata: begin, end, begin_utc, end_utc
Monitoring locations: construction_date

And some additional numerics:

Monitoring locations: altitude, altitude_accuracy, drainage_area, well_constructed_depth, hole_constructed_depth

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of a datetime column, but see no harm in including it.

- The 'value' and 'contributing_drainage_area' columns are coerced to numeric types.
"""
if "time" in df.columns and service == "daily":
df["time"] = pd.to_datetime(df["time"]).dt.date
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of the function originally was just changing the "time" column to simply a date (not a timestamp) for the daily values endpoint only, so that the user wouldn't be confused about whether the value represents a daily aggregated value (min, max, mean, etc.) or a particular measurement. This logic was initially introduced to match what R dataRetrieval was doing: https://github.com/DOI-USGS/dataRetrieval/blob/develop/R/walk_pages.R#L141

Copy link
Collaborator Author

@thodson-usgs thodson-usgs Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It was my recollection that a dt.date lacks datetime functionality, furthermore, the parsing behavior of pd.to_datetime seems to have changed. By default, it correctly omits the time information. Maybe this was a pandas update, but in the current version, it seems correct to leave "time" as a datetime object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing that, or I am misunderstanding. When I run your branch and pull from get_latest_daily, the "time" column shows up as "2025-12-01 00:00:00", whereas in the existing implementation, it shows up as "2025-12-01".

check, md = waterdata.get_latest_daily(
    monitoring_location_id="USGS-05129115",
    parameter_code="00060"
)

I like the existing implementation for daily summaries only, because the date cannot be confused with a singular measurement, and it indeed represents a "summary" value. However, if it causes problems by being inconsisent in the daily summary services, I'm open to applying a consistent rule.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're correct, dt.date does lack datetime functionality. It changes it to an object. Hm. It does make sense to give it a datetime type. Nevermind. We might then just want to say that the additional "00:00:00" added to it doesn't represent a singular value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did it display 00:00:00? it didn't for me, so this behavior was probably changed at some version of pandas.

thodson-usgs and others added 3 commits December 2, 2025 09:01
Co-authored-by: Elise Hinman <121896266+ehinman@users.noreply.github.com>
@thodson-usgs thodson-usgs requested a review from ehinman December 2, 2025 17:56
Copy link
Collaborator

@ehinman ehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now all I can think of is "lgtm". Approved.

@ehinman ehinman merged commit 237ed81 into DOI-USGS:main Dec 2, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants