Skip to content

Commit 83e7b78

Browse files
weiji14maxrjonespre-commit-ci[bot]andersy005dcherian
authored
Pangeo ML Ecosystem blog post (#625)
Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Anderson Banihirwe <13301940+andersy005@users.noreply.github.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Anderson Banihirwe <axbanihirwe@ualr.edu>
1 parent 31cb136 commit 83e7b78

File tree

3 files changed

+126
-2
lines changed

3 files changed

+126
-2
lines changed
308 KB
Loading

src/components/layout.js

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,12 @@ export const Layout = ({
1313
url = 'https://xarray.dev',
1414
enableBanner = false,
1515
}) => {
16-
const bannerTitle = 'Checkout the new blog post on CuPy-Xarray!'
16+
const bannerTitle = 'Checkout the new blog post on Pangeo ML!'
1717
const bannerDescription = ''
1818
const bannerChildren = (
19-
<Link href='/blog/cupy-tutorial'>CuPy-Xarray: Xarray on GPUs!</Link>
19+
<Link href='/blog/pangeo-ml-ecosystem-2023'>
20+
The Pangeo Machine Learning Ecosystem!
21+
</Link>
2022
)
2123
return (
2224
<>
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
title: 'The Pangeo Machine Learning Ecosystem'
3+
date: '2024-04-03'
4+
authors:
5+
- name: Wei Ji Leong
6+
github: weiji14
7+
- name: Max Jones
8+
github: maxrjones
9+
- name: Negin Sobhani
10+
github: negin513
11+
- name: Deepak Cherian
12+
github: dcherian
13+
summary: 'An overview of the open source libraries enabling geospatial machine learning in the Pangeo community.'
14+
---
15+
16+
## TLDR
17+
18+
Open source tools developed by the Pangeo ML community are enabling the shift to cloud-native geospatial Machine Learning.
19+
Join the [Pangeo ML community](https://pangeo.io/meeting-notes.html#working-group-meetings) working in towards scalable [GPU-native](./xarray-kvikio) workflows! 🚀
20+
21+
## Overview
22+
23+
### Building next-generation Machine Learning (ML) tools
24+
25+
At FOSS4G SotM Oceania 2023, we presented on "The ecosystem of geospatial machine learning tools in the Pangeo World" (see the recording [here](https://www.youtube.com/watch?v=X2LBuUfSo5Q)).
26+
One of the driving forces of the Pangeo community is to build better tools that will enable scientific workflows on petabyte-scale datasets, such as Climate/Weather projections that will impact the planet over the coming decades.
27+
28+
To do that, we need to be fast.
29+
30+
These next-generation tools need to be scalable, efficient, and modular.
31+
So we are designing them with three aspects in mind:
32+
33+
- Work with cloud-native data using **GPU-native** compute
34+
- Be able to **stream** subsets of data on-the-fly
35+
- Go from single sensor to **multi-modal** models
36+
37+
Neither of these core technologies are particularly new.
38+
NVIDIA has been leading the development of GPU-native [RAPIDS AI](https://rapids.ai) libraries since 2018.
39+
Streaming has been around since the 2010s if not earlier, and is practically the most common way to consume music and video content nowadays.
40+
Since then, we have also seen the rise in [multi-modal Foundation Models](https://doi.org/10.48550/arXiv.2309.10020) that are able to take in visual (image) and language (text and sound) cues.
41+
42+
Let's now take a step back, and picture what we're working with.
43+
44+
## Layers of the Pangeo Machine Learning stack
45+
46+
![Pangeo Machine Learning Ecosystem in 2023. Bottom row shows cloud-optimized file formats. Middle row shows Array libraries. Top row shows the Pangeo ML libraries and educational resources.](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/pangeo_ml_ecosystem.png)
47+
48+
There are three main layers to a Machine Learning data pipeline.
49+
It starts with data storage file formats at the bottom row, an in-memory array representation in the middle, and high-level libraries and documentation resources that users or developers interact with at the top.
50+
51+
The key to connecting all of these layers are open standards.
52+
53+
### Cloud-native geospatial file formats
54+
55+
For the file formats, we favour [cloud-native geospatial](https://www.ogc.org/ogc-topics/cloud-native-geospatial) because it allows us to efficiently access subsets of data without reading the entire file.
56+
Generally speaking, you would store rasters as [Zarr](https://zarr.dev) or [Cloud-Optimized GeoTIFFs](https://www.cogeo.org), and vectors (points/lines/polygons) in [FlatGeobuf](https://flatgeobuf.org) or [(Geo)Parquet](https://geoparquet.org).
57+
Ideally though, these files would be indexed using a [SpatioTemporal Asset Catalog (STAC)](https://stacspec.org) which makes it easier to discover datasets using standardized queries.
58+
This can be a whole topic in itself, so check out this [guide](https://guide.cloudnativegeo.org) [published in October 2023](https://cloudnativegeo.org/blog/2023/10/introducing-the-cloud-optimized-geospatial-formats-guide) for more details!
59+
60+
### In memory array representations
61+
62+
In the Python world, [NumPy](https://numpy.org) arrays have been the core way of representing arrays in-memory, but there are many others too, along with an ongoing movement to standardize the array/dataframe API at [https://data-apis.org](https://data-apis.org).
63+
Geospatial folks would most likely be familiar with vector libraries like [GeoPandas](https://geopandas.org) GeoDataFrames (built on top of [pandas](https://pandas.pydata.org)); or raster libraries like [rioxarray](https://corteva.github.io/rioxarray) and [stackSTAC](https://stackstac.readthedocs.io) that reads into [xarray](https://xarray.dev) data structures.
64+
65+
NumPy arrays are CPU-based, but there are also libraries like [CuPy](https://cupy.dev) which can do GPU-accelerated computations.
66+
Instead of GeoPandas, you could use libraries like [cuSpatial](https://docs.rapids.ai/api/cuspatial) (built on top of [cuDF](https://docs.rapids.ai/api/cudf) and part of [RAPIDS AI](https://rapids.ai)) to run GPU-accelerated algorithms.
67+
Deep Learning libraries like [PyTorch](https://pytorch.org/docs), [TensorFlow](https://www.tensorflow.org) or [JaX](https://jax.readthedocs.io) tend to be GPU-based as well, but there are also libraries like [Datashader](https://datashader.org) (for visualization) and [Xarray](https://xarray.dev) that are designed to be CPU/GPU agnostic and can hold either.
68+
69+
### High-level Pangeo ML libraries
70+
71+
Finally, to make life simpler, we have high-level convenience libraries wrapping the low-level stuff.
72+
These are designed to have a nicer user interface to connect the underlying file formats and in-memory array representations.
73+
The [Pangeo Machine Learning Working Group](https://pangeo.io/meeting-notes.html#working-group-meetings) mostly works on Climate/Weather datasets, so we'll focus on multi-dimensional arrays for now.
74+
75+
Stepping into the GPU-native world, [cupy-xarray](https://cupy-xarray.readthedocs.io) allows users to use GPU-backed CuPy arrays in n-dimensional Xarray data structures (see our previous [blog post](./cupy-tutorial) on this).
76+
An exciting development on this front is the experimental [kvikIO](https://github.com/rapidsai/kvikio) engine that enables low-latency reading data from Zarr stores into GPU memory using NVIDIA GPUDirect Storage technology (see this [blog post](./xarray-kvikio)).
77+
[Preliminary benchmarks](https://github.com/zarr-developers/zarr-benchmark/discussions/14) suggest that the GPU-based `kvikIO` engine can take about 25% less time for data reads compared to the regular CPU-based `zarr` engine!
78+
79+
Once you have tensors loaded (lazily) into an Xarray data structure, [xbatcher](https://xbatcher.readthedocs.io) enables efficient iteration over batches of data in a streaming fashion.
80+
This library makes it easier to train machine learning models on big datacubes such as time-series datasets or multi-variate ocean/climate model outputs, as users can do on-the-fly slicing using named variables (more readable than numbered indexes).
81+
There is also an experimental [cache mechanism](https://github.com/xarray-contrib/xbatcher/pull/167) we'd like more people to try and provide feedback on!
82+
83+
To connect all of the pieces, [zen3geo](https://zen3geo.readthedocs.io) implements Composable DataPipes for geospatial.
84+
It acts as the glue to chain together different building blocks, such as readers for Vector/Raster file formats, converters between different in-memory array representations, and even custom pre-processing functions.
85+
The composable design pattern makes it well suited for building complex machine learning data pipelines for multi-modal models that can take in different inputs (e.g. Images, Point Clouds, Trajectory, Text/Sound, etc).
86+
Going forward, there are plans to [refactor the backend to be asynchronous-first](https://github.com/weiji14/zen3geo/discussions/117) to overcome I/O bottlenecks.
87+
88+
## Summary
89+
90+
We've presented a snapshot of the Pangeo Machine Learning ecosystem from 2023.
91+
The basis of any machine learning project is the data, and we touched on how cloud-native geospatial file formats and in-memory array representations built on open standards act as the foundation for our work.
92+
Lastly, we highlighted some of the high-level Pangeo ML libraries enabling user friendly access to GPU-native compute, streaming data batches, and composable geospatial data pipelines.
93+
94+
## Where to learn more
95+
96+
- Educational resources:
97+
98+
- [Project Pythia Cookbooks](https://cookbooks.projectpythia.org)
99+
- [GeoSMART Machine Learning Curriculum](https://geo-smart.github.io/mlgeo-book)
100+
- [University of Washington Hackweeks as a Service](https://guidebook.hackweek.io)
101+
102+
- Pangeo ML Working Group:
103+
104+
- [Monthly meetings](https://pangeo.io/meeting-notes.html#working-group-meetings)
105+
- [Discourse Forum](https://discourse.pangeo.io/tag/machine-learning)
106+
107+
## Acknowledgments
108+
109+
The work above is the cumulative effort of folks from the Pangeo, Xarray and RAPIDS AI community, plus more!
110+
In particular, we'd like to acknowledge the work of [Deepak Cherian](https://github.com/dcherian) at [Earthmover](https://earthmover.io) and [Negin Sobhani](https://github.com/negin513) at [NCAR](https://ncar.ucar.edu) for their work on cupy-xarray/kvikIO,
111+
[Max Jones](https://github.com/maxrjones) at [Carbonplan](https://carbonplan.org) for recent developments on the xbatcher package,
112+
and [Wei Ji Leong](https://github.com/weiji14) at [Development Seed](https://developmentseed.org) for the development of zen3geo. Xbatcher development was partly funded by NASA'S Advancing Collaborative Connections for Earth System Science (ACCESS) award "Pangeo ML - Open Source Tools and Pipelines for Scalable Machine Learning Using NASA Earth Observation Data" (80NSSC21M0065). Cupy-Xarray development was partly funded by NSF Earthcube award ["Jupyter Meets the Earth" (1928374)](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1928374); and NASA's Open Source Tools, Frameworks, and Libraries award "Enhancing analysis of NASA data with the open-source Python Xarray Library" (80NSSC22K0345)
113+
114+
## Appendix I: Further Reading
115+
116+
- [The Composable Codex](https://voltrondata.com/codex)
117+
- [zen3geo 2022 Pangeo ML Working Group presentation](https://discourse.pangeo.io/t/monday-november-07-2022-machine-learning-working-group-presentation-zen3geo-guiding-earth-observation-data-on-its-path-to-enlightenment-by-wei-ji-leong/2883) ([recording](https://www.youtube.com/watch?v=8uhOtQUTuDg))
118+
- [Xbatcher 2023 AMS presentation](https://doi.org/10.6084/m9.figshare.22264072.v1) ([recording](https://ams.confex.com/recording/ams/103ANNUAL/mp4/CGNTFL54WCL/67cfb841cba94216ff99f1eb15286ba2/session63444_5.mp4) (starts at 45:30))
119+
- [CuPy-Xarray tutorial at SciPy 2023](https://doi.org/10.5281/zenodo.8247471) ([jupyter-book](https://negin513.github.io/cupy-xarray-tutorials/README.html))
120+
- [Pangeo ML Ecosystem presentation at FOSS4G SotM Oceania 2023](https://github.com/weiji14/foss4g2023oceania) ([recording](https://www.youtube.com/watch?v=X2LBuUfSo5Q))
121+
- [Earthmover blog post on cloud native data loaders for machine learning using xarray and zarr](https://earthmover.io/blog/cloud-native-dataloader)
122+
- [Development Seed blog post on GPU-native machine learning](https://developmentseed.org/blog/2024-03-19-combining-cloud-gpu-native)

0 commit comments

Comments
 (0)