soildb

PyPI version License: MIT

Python client for the USDA-NRCS Soil Data Access (SDA) web service, Air and Water Database (AWDB), and other National Cooperative Soil Survey data sources.

Overview

soildb provides Python access to the USDA Soil Data Access (SDA) web service https://sdmdataaccess.nrcs.usda.gov/.

Query soil survey data, export to pandas/polars DataFrames, and handle spatial queries.

Installation

pip install soildb

For spatial functionality:

pip install soildb[spatial]

For all optional features support:

pip install soildb[all]

Features

  • Query soil survey data from SDA
  • Export to pandas and polars DataFrames
  • Build custom SQL queries with fluent interface
  • Spatial queries with points, bounding boxes, and polygons
  • Bulk data fetching with automatic pagination
  • Async I/O for high performance and concurrency
  • Access to AWDB (Air and Water Database) for real-time monitoring data
  • SCAN/SNOTEL station data with soil moisture, temperature, snow, and weather data
  • Advanced station filtering with wildcards and sensor discovery
  • Multi-depth soil data queries and forecast data retrieval

Quick Start

Query Builder

This is a basic example of building a custom query and getting the results:

from soildb import Query
    
query = (Query()
        .select("mukey", "muname", "musym")
        .from_("mapunit")
        .inner_join("legend", "mapunit.lkey = legend.lkey")
        .where("areasymbol = 'IA109'")
        .limit(5))
    
# inspect query
print(query.to_sql())

result = await soildb.SDAClient().execute(query)

df = result.to_pandas()
print(df.head())

Async Setup

You may have noticed that we need to await the query execution result.

All soildb functions are async. Here’s how to run them in different environments like Jupyter notebooks, VSCode, or regular Python scripts.

Basic Async Execution

import asyncio
import soildb

async def main():
    # Your async code here
    mapunits = await soildb.get_mapunit_by_areasymbol("IA109")
    df = mapunits.to_pandas()
    return df

# Handle different environments
try:
    # Check if there's already an event loop (Jupyter, etc.)
    loop = asyncio.get_running_loop()
    import nest_asyncio
    nest_asyncio.apply()
    result = loop.run_until_complete(main())
except RuntimeError:
    # No existing loop, use asyncio.run()
    result = asyncio.run(main())

result

For comprehensive async usage, see the Async Programming Guide.

Getting Started

New to soildb? Start with: 1. Quick Start — Choose your pattern (script vs app) 2. Common Workflows — Detailed task-based examples 3. Examples — Runnable code samples

Convenience Functions

soildb provides several high-level functions for common tasks:

async with soildb.SDAClient() as client:
    mapunits = await soildb.get_mapunit_by_areasymbol("IA109", client=client)
    df = mapunits.to_pandas()
    print(f"Found {len(df)} map units")
    df.head()

If you have suggestions for new convenience functions please file a “feature request” on GitHub.

Spatial Queries

soildb also offers support for queries by location via spatial_query(). You can specify arbitrary geometry to target several spatial and tabular types of results.

import asyncio

async def spatial_query_example():
    from soildb import spatial_query
    
    # Point query
    async with soildb.SDAClient() as client:
        response = await spatial_query(
            geometry="POINT (-93.6 42.0)",
            table="mupolygon",
            spatial_relation="intersects"
        )
        df = response.to_pandas()
        print(f"Point query found {len(df)} results")
        return df

# Handle different environments
try:
    # Check if there's already an event loop (Jupyter, etc.)
    loop = asyncio.get_running_loop()
    import nest_asyncio
    nest_asyncio.apply()
    result = loop.run_until_complete(spatial_query_example())
except RuntimeError:
    # No existing loop, use asyncio.run()
    result = asyncio.run(spatial_query_example())

result

Bulk Data Fetching

soildb makes it easy to retrieve large datasets efficiently, using concurrent requests and built-in functions that automatically handle pagination.

import asyncio

async def bulk_fetch_example():
    from soildb import fetch_by_keys, get_mukey_by_areasymbol
    
    # Get mukeys for multiple areas concurrently
    areas = ["IA109", "IA113", "IA117"]
    mukeys_tasks = [
        get_mukey_by_areasymbol([area]) 
        for area in areas
    ]
    
    # Execute all mukey requests concurrently
    mukeys_results = await asyncio.gather(*mukeys_tasks)
    
    # Flatten the results (each task returns a list)
    all_mukeys = []
    for mukeys in mukeys_results:
        all_mukeys.extend(mukeys)
    
    print(f"Found {len(all_mukeys)} mukeys across {len(areas)} areas")
    
    # Fetch data in chunks automatically
    response = await fetch_by_keys(
        all_mukeys, 
        "component", 
        key_column="mukey", 
        chunk_size=100,
        columns=["mukey", "cokey", "compname", "localphase", "comppct_r"]
    )
    df = response.to_pandas()
    print(f"Fetched {len(df)} component records")
    return df

# Handle different environments
try:
    # Check if there's already an event loop (Jupyter, etc.)
    loop = asyncio.get_running_loop()
    import nest_asyncio
    nest_asyncio.apply()
    result = loop.run_until_complete(bulk_fetch_example())
except RuntimeError:
    # No existing loop, use asyncio.run()
    result = asyncio.run(bulk_fetch_example())

result.head(10)

The component table has a hierarchical relationship:

  • mukey (map unit key) is the parent
  • cokey (component key) is the child

So when fetching components, you typically want to filter by mukey to get all components for specific map units.

Use fetch_by_keys() with key_column="mukey" to achieve this and demonstrate pagination over chunks with 100 rows each.

AWDB (Air and Water Database) Access

NEW in soildb: Access real-time monitoring data from SCAN, SNOTEL, and other NRCS networks!

Quick AWDB Start

from soildb.awdb import AWDBClient

# Get soil moisture data from SCAN stations
async def awdb_example():
    async with AWDBClient() as client:
        # Find stations with soil moisture sensors in California
        stations = await client.get_stations(
            elements=['SMS:*'],  # Soil moisture sensors at any depth
            state_codes=['CA']
        )

        # Get multi-depth soil moisture data
        if stations:
            station = stations[0]
            data = await client.get_station_data(
                station.station_triplet,
                'SMS',  # Soil moisture element
                '2024-01-01',
                '2024-01-10'
            )
            print(f"Retrieved {len(data)} soil moisture readings")

# Handle async execution
try:
    loop = asyncio.get_running_loop()
    import nest_asyncio
    nest_asyncio.apply()
    loop.run_until_complete(awdb_example())
except RuntimeError:
    asyncio.run(awdb_example())

AWDB Features

  • Station Discovery: Find stations by location, network, sensors, or name patterns
  • Multi-Depth Soil Data: Query soil moisture and temperature at multiple depths
  • Real-Time Weather: Access precipitation, temperature, wind, and solar radiation
  • Snow Data: SNOTEL snow water equivalent, depth, and snowmelt data
  • Forecast Data: Water supply and drought forecasts
  • Quality Control: Access to quality flags and original values
  • Wildcard Filtering: Advanced station filtering with * wildcards

Advanced Station Filtering

from soildb.awdb.convenience import discover_stations

# Find all SNOTEL stations in Oregon or Washington
snotel_stations = await discover_stations(
    station_triplets=['*:OR:SNTL', '*:WA:SNTL']
)

# Find stations with soil moisture sensors in specific HUC area
soil_stations = await discover_stations(
    elements=['SMS:*'],
    hucs=['170601*']  # Columbia River basin
)

# Find stations by name pattern
lake_stations = await discover_stations(
    station_names=['*Lake*']
)

Multi-Depth Soil Data

from soildb.awdb.convenience import get_soil_moisture_by_depth

# Get soil moisture at multiple depths for a location
soil_data = await get_soil_moisture_by_depth(
    latitude=37.5,
    longitude=-120.5,
    start_date='2024-01-01',
    end_date='2024-01-10',
    depths=[-2, -4, -8, -20, -40]  # Negative = depth below surface
)

# Each depth returns time series data
for depth, readings in soil_data.items():
    print(f"Depth {depth}\": {len(readings)} readings")

Examples

See the examples/ directory and documentation for detailed usage patterns.

License

This project is licensed under the MIT License. See the LICENSE file for details.