|
| 1 | +--- |
| 2 | +title: 'The Pangeo Machine Learning Ecosystem' |
| 3 | +date: '2024-04-03' |
| 4 | +authors: |
| 5 | + - name: Wei Ji Leong |
| 6 | + github: weiji14 |
| 7 | + - name: Max Jones |
| 8 | + github: maxrjones |
| 9 | + - name: Negin Sobhani |
| 10 | + github: negin513 |
| 11 | + - name: Deepak Cherian |
| 12 | + github: dcherian |
| 13 | +summary: 'An overview of the open source libraries enabling geospatial machine learning in the Pangeo community.' |
| 14 | +--- |
| 15 | + |
| 16 | +## TLDR |
| 17 | + |
| 18 | +Open source tools developed by the Pangeo ML community are enabling the shift to cloud-native geospatial Machine Learning. |
| 19 | +Join the [Pangeo ML community](https://pangeo.io/meeting-notes.html#working-group-meetings) working in towards scalable [GPU-native](./xarray-kvikio) workflows! 🚀 |
| 20 | + |
| 21 | +## Overview |
| 22 | + |
| 23 | +### Building next-generation Machine Learning (ML) tools |
| 24 | + |
| 25 | +At FOSS4G SotM Oceania 2023, we presented on "The ecosystem of geospatial machine learning tools in the Pangeo World" (see the recording [here](https://www.youtube.com/watch?v=X2LBuUfSo5Q)). |
| 26 | +One of the driving forces of the Pangeo community is to build better tools that will enable scientific workflows on petabyte-scale datasets, such as Climate/Weather projections that will impact the planet over the coming decades. |
| 27 | + |
| 28 | +To do that, we need to be fast. |
| 29 | + |
| 30 | +These next-generation tools need to be scalable, efficient, and modular. |
| 31 | +So we are designing them with three aspects in mind: |
| 32 | + |
| 33 | +- Work with cloud-native data using **GPU-native** compute |
| 34 | +- Be able to **stream** subsets of data on-the-fly |
| 35 | +- Go from single sensor to **multi-modal** models |
| 36 | + |
| 37 | +Neither of these core technologies are particularly new. |
| 38 | +NVIDIA has been leading the development of GPU-native [RAPIDS AI](https://rapids.ai) libraries since 2018. |
| 39 | +Streaming has been around since the 2010s if not earlier, and is practically the most common way to consume music and video content nowadays. |
| 40 | +Since then, we have also seen the rise in [multi-modal Foundation Models](https://doi.org/10.48550/arXiv.2309.10020) that are able to take in visual (image) and language (text and sound) cues. |
| 41 | + |
| 42 | +Let's now take a step back, and picture what we're working with. |
| 43 | + |
| 44 | +## Layers of the Pangeo Machine Learning stack |
| 45 | + |
| 46 | + |
| 47 | + |
| 48 | +There are three main layers to a Machine Learning data pipeline. |
| 49 | +It starts with data storage file formats at the bottom row, an in-memory array representation in the middle, and high-level libraries and documentation resources that users or developers interact with at the top. |
| 50 | + |
| 51 | +The key to connecting all of these layers are open standards. |
| 52 | + |
| 53 | +### Cloud-native geospatial file formats |
| 54 | + |
| 55 | +For the file formats, we favour [cloud-native geospatial](https://www.ogc.org/ogc-topics/cloud-native-geospatial) because it allows us to efficiently access subsets of data without reading the entire file. |
| 56 | +Generally speaking, you would store rasters as [Zarr](https://zarr.dev) or [Cloud-Optimized GeoTIFFs](https://www.cogeo.org), and vectors (points/lines/polygons) in [FlatGeobuf](https://flatgeobuf.org) or [(Geo)Parquet](https://geoparquet.org). |
| 57 | +Ideally though, these files would be indexed using a [SpatioTemporal Asset Catalog (STAC)](https://stacspec.org) which makes it easier to discover datasets using standardized queries. |
| 58 | +This can be a whole topic in itself, so check out this [guide](https://guide.cloudnativegeo.org) [published in October 2023](https://cloudnativegeo.org/blog/2023/10/introducing-the-cloud-optimized-geospatial-formats-guide) for more details! |
| 59 | + |
| 60 | +### In memory array representations |
| 61 | + |
| 62 | +In the Python world, [NumPy](https://numpy.org) arrays have been the core way of representing arrays in-memory, but there are many others too, along with an ongoing movement to standardize the array/dataframe API at [https://data-apis.org](https://data-apis.org). |
| 63 | +Geospatial folks would most likely be familiar with vector libraries like [GeoPandas](https://geopandas.org) GeoDataFrames (built on top of [pandas](https://pandas.pydata.org)); or raster libraries like [rioxarray](https://corteva.github.io/rioxarray) and [stackSTAC](https://stackstac.readthedocs.io) that reads into [xarray](https://xarray.dev) data structures. |
| 64 | + |
| 65 | +NumPy arrays are CPU-based, but there are also libraries like [CuPy](https://cupy.dev) which can do GPU-accelerated computations. |
| 66 | +Instead of GeoPandas, you could use libraries like [cuSpatial](https://docs.rapids.ai/api/cuspatial) (built on top of [cuDF](https://docs.rapids.ai/api/cudf) and part of [RAPIDS AI](https://rapids.ai)) to run GPU-accelerated algorithms. |
| 67 | +Deep Learning libraries like [PyTorch](https://pytorch.org/docs), [TensorFlow](https://www.tensorflow.org) or [JaX](https://jax.readthedocs.io) tend to be GPU-based as well, but there are also libraries like [Datashader](https://datashader.org) (for visualization) and [Xarray](https://xarray.dev) that are designed to be CPU/GPU agnostic and can hold either. |
| 68 | + |
| 69 | +### High-level Pangeo ML libraries |
| 70 | + |
| 71 | +Finally, to make life simpler, we have high-level convenience libraries wrapping the low-level stuff. |
| 72 | +These are designed to have a nicer user interface to connect the underlying file formats and in-memory array representations. |
| 73 | +The [Pangeo Machine Learning Working Group](https://pangeo.io/meeting-notes.html#working-group-meetings) mostly works on Climate/Weather datasets, so we'll focus on multi-dimensional arrays for now. |
| 74 | + |
| 75 | +Stepping into the GPU-native world, [cupy-xarray](https://cupy-xarray.readthedocs.io) allows users to use GPU-backed CuPy arrays in n-dimensional Xarray data structures (see our previous [blog post](./cupy-tutorial) on this). |
| 76 | +An exciting development on this front is the experimental [kvikIO](https://github.com/rapidsai/kvikio) engine that enables low-latency reading data from Zarr stores into GPU memory using NVIDIA GPUDirect Storage technology (see this [blog post](./xarray-kvikio)). |
| 77 | +[Preliminary benchmarks](https://github.com/zarr-developers/zarr-benchmark/discussions/14) suggest that the GPU-based `kvikIO` engine can take about 25% less time for data reads compared to the regular CPU-based `zarr` engine! |
| 78 | + |
| 79 | +Once you have tensors loaded (lazily) into an Xarray data structure, [xbatcher](https://xbatcher.readthedocs.io) enables efficient iteration over batches of data in a streaming fashion. |
| 80 | +This library makes it easier to train machine learning models on big datacubes such as time-series datasets or multi-variate ocean/climate model outputs, as users can do on-the-fly slicing using named variables (more readable than numbered indexes). |
| 81 | +There is also an experimental [cache mechanism](https://github.com/xarray-contrib/xbatcher/pull/167) we'd like more people to try and provide feedback on! |
| 82 | + |
| 83 | +To connect all of the pieces, [zen3geo](https://zen3geo.readthedocs.io) implements Composable DataPipes for geospatial. |
| 84 | +It acts as the glue to chain together different building blocks, such as readers for Vector/Raster file formats, converters between different in-memory array representations, and even custom pre-processing functions. |
| 85 | +The composable design pattern makes it well suited for building complex machine learning data pipelines for multi-modal models that can take in different inputs (e.g. Images, Point Clouds, Trajectory, Text/Sound, etc). |
| 86 | +Going forward, there are plans to [refactor the backend to be asynchronous-first](https://github.com/weiji14/zen3geo/discussions/117) to overcome I/O bottlenecks. |
| 87 | + |
| 88 | +## Summary |
| 89 | + |
| 90 | +We've presented a snapshot of the Pangeo Machine Learning ecosystem from 2023. |
| 91 | +The basis of any machine learning project is the data, and we touched on how cloud-native geospatial file formats and in-memory array representations built on open standards act as the foundation for our work. |
| 92 | +Lastly, we highlighted some of the high-level Pangeo ML libraries enabling user friendly access to GPU-native compute, streaming data batches, and composable geospatial data pipelines. |
| 93 | + |
| 94 | +## Where to learn more |
| 95 | + |
| 96 | +- Educational resources: |
| 97 | + |
| 98 | + - [Project Pythia Cookbooks](https://cookbooks.projectpythia.org) |
| 99 | + - [GeoSMART Machine Learning Curriculum](https://geo-smart.github.io/mlgeo-book) |
| 100 | + - [University of Washington Hackweeks as a Service](https://guidebook.hackweek.io) |
| 101 | + |
| 102 | +- Pangeo ML Working Group: |
| 103 | + |
| 104 | + - [Monthly meetings](https://pangeo.io/meeting-notes.html#working-group-meetings) |
| 105 | + - [Discourse Forum](https://discourse.pangeo.io/tag/machine-learning) |
| 106 | + |
| 107 | +## Acknowledgments |
| 108 | + |
| 109 | +The work above is the cumulative effort of folks from the Pangeo, Xarray and RAPIDS AI community, plus more! |
| 110 | +In particular, we'd like to acknowledge the work of [Deepak Cherian](https://github.com/dcherian) at [Earthmover](https://earthmover.io) and [Negin Sobhani](https://github.com/negin513) at [NCAR](https://ncar.ucar.edu) for their work on cupy-xarray/kvikIO, |
| 111 | +[Max Jones](https://github.com/maxrjones) at [Carbonplan](https://carbonplan.org) for recent developments on the xbatcher package, |
| 112 | +and [Wei Ji Leong](https://github.com/weiji14) at [Development Seed](https://developmentseed.org) for the development of zen3geo. Xbatcher development was partly funded by NASA'S Advancing Collaborative Connections for Earth System Science (ACCESS) award "Pangeo ML - Open Source Tools and Pipelines for Scalable Machine Learning Using NASA Earth Observation Data" (80NSSC21M0065). Cupy-Xarray development was partly funded by NSF Earthcube award ["Jupyter Meets the Earth" (1928374)](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1928374); and NASA's Open Source Tools, Frameworks, and Libraries award "Enhancing analysis of NASA data with the open-source Python Xarray Library" (80NSSC22K0345) |
| 113 | + |
| 114 | +## Appendix I: Further Reading |
| 115 | + |
| 116 | +- [The Composable Codex](https://voltrondata.com/codex) |
| 117 | +- [zen3geo 2022 Pangeo ML Working Group presentation](https://discourse.pangeo.io/t/monday-november-07-2022-machine-learning-working-group-presentation-zen3geo-guiding-earth-observation-data-on-its-path-to-enlightenment-by-wei-ji-leong/2883) ([recording](https://www.youtube.com/watch?v=8uhOtQUTuDg)) |
| 118 | +- [Xbatcher 2023 AMS presentation](https://doi.org/10.6084/m9.figshare.22264072.v1) ([recording](https://ams.confex.com/recording/ams/103ANNUAL/mp4/CGNTFL54WCL/67cfb841cba94216ff99f1eb15286ba2/session63444_5.mp4) (starts at 45:30)) |
| 119 | +- [CuPy-Xarray tutorial at SciPy 2023](https://doi.org/10.5281/zenodo.8247471) ([jupyter-book](https://negin513.github.io/cupy-xarray-tutorials/README.html)) |
| 120 | +- [Pangeo ML Ecosystem presentation at FOSS4G SotM Oceania 2023](https://github.com/weiji14/foss4g2023oceania) ([recording](https://www.youtube.com/watch?v=X2LBuUfSo5Q)) |
| 121 | +- [Earthmover blog post on cloud native data loaders for machine learning using xarray and zarr](https://earthmover.io/blog/cloud-native-dataloader) |
| 122 | +- [Development Seed blog post on GPU-native machine learning](https://developmentseed.org/blog/2024-03-19-combining-cloud-gpu-native) |
0 commit comments