Integrate ZFS pool health and alerting into the monitoring stack #2

Open
opened 2026-03-07 06:35:56 +00:00 by adamksmith · 0 comments
Owner

Problem

The JBOD monitor currently treats ZFS pool membership as a static label — it shows which pool a drive belongs to, but has zero visibility into actual ZFS health. Pool degradation, scrub errors, capacity warnings, and resilver progress are all invisible until you SSH in and run zpool status manually.

Since we already have the drive-level SMART data and enclosure topology, ZFS health is the missing layer that ties physical hardware state to logical storage state.

Proposed Solution

Add a ZFS health module that polls pool status, scrub history, and per-vdev error counters, then surfaces alerts through both the API and the frontend dashboard.

Architecture

Background Poller (existing smart_poll_loop, extended)
  ├── smartctl per-drive       → Redis jbod:smart:{device}
  ├── zpool status -P          → Redis jbod:zfs:pools
  ├── zpool list -Hp            → Redis jbod:zfs:capacity
  └── zpool events -H (tail)   → Redis jbod:zfs:events (ring buffer)

API
  ├── GET /api/zfs/pools          → All pools with health, vdev tree, errors
  ├── GET /api/zfs/pools/{name}   → Single pool detail
  ├── GET /api/zfs/alerts         → Active alerts (degraded, errors, capacity, scrub overdue)
  └── GET /api/overview           → Extended with zfs_healthy, zfs_alerts[] top-level fields

Implementation Details

1. ZFS Pool Poller (services/zfs.py)

Extend the existing get_zfs_pool_map() into a full ZFS health service:

async def get_zfs_pool_health() -> list[ZFSPool]:
    """
    Parse `zpool status -P` and `zpool list -Hp` to build full pool health model.
    """
    pools = []
    
    # Pool status (state, vdev tree, errors, scrub status)
    proc = await asyncio.create_subprocess_exec(
        "zpool", "status", "-P",
        stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE,
    )
    stdout, _ = await proc.communicate()
    pools = parse_zpool_status(stdout.decode())
    
    # Capacity and fragmentation
    proc2 = await asyncio.create_subprocess_exec(
        "zpool", "list", "-Hp", "-o",
        "name,size,alloc,free,frag,cap,health",
        stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE,
    )
    stdout2, _ = await proc2.communicate()
    merge_capacity_data(pools, stdout2.decode())
    
    return pools

2. Data Model (models/zfs_schemas.py)

class ZFSVdev(BaseModel):
    name: str                    # e.g. "raidz2-0", "mirror-1", "/dev/sdX"
    type: str                    # "raidz2", "mirror", "disk", "spare", "cache", "log"
    state: str                   # "ONLINE", "DEGRADED", "FAULTED", "OFFLINE", "REMOVED", "UNAVAIL"
    read_errors: int
    write_errors: int
    checksum_errors: int
    children: list["ZFSVdev"]    # Recursive for vdev tree
    device: str | None = None    # Base device name if leaf (e.g. "sda")
    slow_ios: int | None = None  # If available

class ZFSScrub(BaseModel):
    state: str                   # "scrub repaired", "scrub in progress", "none requested"
    started: datetime | None
    finished: datetime | None
    duration_seconds: int | None
    errors_repaired: int
    bytes_scanned: int | None
    percent_complete: float | None  # For in-progress scrubs

class ZFSPool(BaseModel):
    name: str
    state: str                   # "ONLINE", "DEGRADED", "FAULTED", "SUSPENDED"
    status_message: str | None   # Free-text status line from zpool status
    size_bytes: int
    allocated_bytes: int
    free_bytes: int
    fragmentation_pct: float
    capacity_pct: float
    vdevs: list[ZFSVdev]
    scrub: ZFSScrub | None
    errors: str                  # "No known data errors" or error description

3. Alert Engine (services/zfs_alerts.py)

Generate alerts from pool health data. Each alert has a severity and maps back to the physical drive/enclosure when possible.

class ZFSAlert(BaseModel):
    severity: Literal["critical", "warning", "info"]
    pool: str
    category: str       # "degraded", "checksum", "capacity", "scrub", "resilver", "faulted"
    message: str
    device: str | None  # Physical device if applicable
    slot: int | None    # Enclosure slot if resolvable
    enclosure_id: str | None

ALERT_RULES = [
    # Critical
    ("pool state FAULTED",           "critical", "faulted"),
    ("pool state SUSPENDED",         "critical", "faulted"),
    ("vdev state FAULTED",           "critical", "faulted"),
    ("vdev state UNAVAIL",           "critical", "faulted"),
    
    # Warning  
    ("pool state DEGRADED",          "warning",  "degraded"),
    ("vdev state DEGRADED",          "warning",  "degraded"),
    ("vdev state REMOVED",           "warning",  "degraded"),
    ("checksum_errors > 0",          "warning",  "checksum"),
    ("capacity_pct >= 80",           "warning",  "capacity"),
    ("capacity_pct >= 90",           "critical", "capacity"),
    ("scrub age > 14 days",          "warning",  "scrub"),
    ("scrub age > 30 days",          "critical", "scrub"),
    ("resilver in progress",         "info",     "resilver"),
    
    # Info
    ("write_errors > 0",             "warning",  "checksum"),
    ("read_errors > 0",              "warning",  "checksum"),
]

Key feature: When an alert references a device (e.g., sda has checksum errors), cross-reference the enclosure topology to include the physical slot and enclosure ID. This lets the frontend highlight the exact bay in the grid view.

4. Redis Cache Keys

Key Value TTL
jbod:zfs:pools JSON array of ZFSPool 120s
jbod:zfs:alerts JSON array of ZFSAlert 120s
jbod:zfs:pool:{name} JSON single pool detail 120s

Polled in the same background loop as SMART data.

5. API Endpoints

GET /api/zfs/pools
Returns all pools with full health, vdev tree, scrub status, capacity.

GET /api/zfs/pools/{name}
Single pool detail with full vdev tree expanded.

GET /api/zfs/alerts
Active alerts only. Filterable by ?severity=critical&pool=tank.

GET /api/overview (extended)
Add top-level fields:

{
  "healthy": true,
  "drive_count": 78,
  "warning_count": 4,
  "error_count": 0,
  "zfs_healthy": true,
  "zfs_alerts": [
    {
      "severity": "warning",
      "pool": "archive",
      "category": "scrub",
      "message": "Last scrub completed 18 days ago",
      "device": null,
      "slot": null,
      "enclosure_id": null
    }
  ],
  "zfs_pools": [
    {"name": "tank", "state": "ONLINE", "capacity_pct": 72.3, "drive_count": 46},
    {"name": "fast", "state": "ONLINE", "capacity_pct": 45.1, "drive_count": 12},
    {"name": "archive", "state": "ONLINE", "capacity_pct": 68.8, "drive_count": 18}
  ],
  "enclosures": [ ... ]
}

6. Frontend Integration

Extend the existing dashboard:

  • New alert banner at the top of the page (above stat cards) when zfs_alerts is non-empty — color-coded by severity, dismissable per session
  • Pool health stat cards — add a row below the existing drive stats showing each pool's state, capacity bar, and last scrub age
  • Grid view overlay — when a ZFS alert references a specific device, pulse/highlight that slot in the enclosure grid (reuse the warning amber or add a new "zfs-error" red ring)
  • Drive detail modal — add a "ZFS Status" section below the existing ZFS Membership card showing the vdev path, error counts (read/write/checksum) for that specific drive within its pool

Acceptance Criteria

  • services/zfs.py parses zpool status -P and zpool list -Hp into typed models
  • Alert engine generates alerts from pool health with configurable thresholds
  • Alerts cross-reference enclosure topology to map device → slot → enclosure
  • Background poller includes ZFS data in the same poll loop
  • ZFS data cached in Redis with appropriate TTLs
  • /api/zfs/pools, /api/zfs/alerts endpoints working
  • /api/overview extended with zfs_healthy, zfs_alerts, zfs_pools fields
  • Frontend: alert banner, pool stat cards, drive-level error counts in detail modal
  • Graceful handling when ZFS is not installed or no pools exist
## Problem The JBOD monitor currently treats ZFS pool membership as a static label — it shows which pool a drive belongs to, but has zero visibility into actual ZFS health. Pool degradation, scrub errors, capacity warnings, and resilver progress are all invisible until you SSH in and run `zpool status` manually. Since we already have the drive-level SMART data and enclosure topology, ZFS health is the missing layer that ties physical hardware state to logical storage state. ## Proposed Solution Add a ZFS health module that polls pool status, scrub history, and per-vdev error counters, then surfaces alerts through both the API and the frontend dashboard. ### Architecture ``` Background Poller (existing smart_poll_loop, extended) ├── smartctl per-drive → Redis jbod:smart:{device} ├── zpool status -P → Redis jbod:zfs:pools ├── zpool list -Hp → Redis jbod:zfs:capacity └── zpool events -H (tail) → Redis jbod:zfs:events (ring buffer) API ├── GET /api/zfs/pools → All pools with health, vdev tree, errors ├── GET /api/zfs/pools/{name} → Single pool detail ├── GET /api/zfs/alerts → Active alerts (degraded, errors, capacity, scrub overdue) └── GET /api/overview → Extended with zfs_healthy, zfs_alerts[] top-level fields ``` ### Implementation Details #### 1. ZFS Pool Poller (`services/zfs.py`) Extend the existing `get_zfs_pool_map()` into a full ZFS health service: ```python async def get_zfs_pool_health() -> list[ZFSPool]: """ Parse `zpool status -P` and `zpool list -Hp` to build full pool health model. """ pools = [] # Pool status (state, vdev tree, errors, scrub status) proc = await asyncio.create_subprocess_exec( "zpool", "status", "-P", stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE, ) stdout, _ = await proc.communicate() pools = parse_zpool_status(stdout.decode()) # Capacity and fragmentation proc2 = await asyncio.create_subprocess_exec( "zpool", "list", "-Hp", "-o", "name,size,alloc,free,frag,cap,health", stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE, ) stdout2, _ = await proc2.communicate() merge_capacity_data(pools, stdout2.decode()) return pools ``` #### 2. Data Model (`models/zfs_schemas.py`) ```python class ZFSVdev(BaseModel): name: str # e.g. "raidz2-0", "mirror-1", "/dev/sdX" type: str # "raidz2", "mirror", "disk", "spare", "cache", "log" state: str # "ONLINE", "DEGRADED", "FAULTED", "OFFLINE", "REMOVED", "UNAVAIL" read_errors: int write_errors: int checksum_errors: int children: list["ZFSVdev"] # Recursive for vdev tree device: str | None = None # Base device name if leaf (e.g. "sda") slow_ios: int | None = None # If available class ZFSScrub(BaseModel): state: str # "scrub repaired", "scrub in progress", "none requested" started: datetime | None finished: datetime | None duration_seconds: int | None errors_repaired: int bytes_scanned: int | None percent_complete: float | None # For in-progress scrubs class ZFSPool(BaseModel): name: str state: str # "ONLINE", "DEGRADED", "FAULTED", "SUSPENDED" status_message: str | None # Free-text status line from zpool status size_bytes: int allocated_bytes: int free_bytes: int fragmentation_pct: float capacity_pct: float vdevs: list[ZFSVdev] scrub: ZFSScrub | None errors: str # "No known data errors" or error description ``` #### 3. Alert Engine (`services/zfs_alerts.py`) Generate alerts from pool health data. Each alert has a severity and maps back to the physical drive/enclosure when possible. ```python class ZFSAlert(BaseModel): severity: Literal["critical", "warning", "info"] pool: str category: str # "degraded", "checksum", "capacity", "scrub", "resilver", "faulted" message: str device: str | None # Physical device if applicable slot: int | None # Enclosure slot if resolvable enclosure_id: str | None ALERT_RULES = [ # Critical ("pool state FAULTED", "critical", "faulted"), ("pool state SUSPENDED", "critical", "faulted"), ("vdev state FAULTED", "critical", "faulted"), ("vdev state UNAVAIL", "critical", "faulted"), # Warning ("pool state DEGRADED", "warning", "degraded"), ("vdev state DEGRADED", "warning", "degraded"), ("vdev state REMOVED", "warning", "degraded"), ("checksum_errors > 0", "warning", "checksum"), ("capacity_pct >= 80", "warning", "capacity"), ("capacity_pct >= 90", "critical", "capacity"), ("scrub age > 14 days", "warning", "scrub"), ("scrub age > 30 days", "critical", "scrub"), ("resilver in progress", "info", "resilver"), # Info ("write_errors > 0", "warning", "checksum"), ("read_errors > 0", "warning", "checksum"), ] ``` **Key feature:** When an alert references a device (e.g., `sda` has checksum errors), cross-reference the enclosure topology to include the physical slot and enclosure ID. This lets the frontend highlight the exact bay in the grid view. #### 4. Redis Cache Keys | Key | Value | TTL | |---|---|---| | `jbod:zfs:pools` | JSON array of ZFSPool | 120s | | `jbod:zfs:alerts` | JSON array of ZFSAlert | 120s | | `jbod:zfs:pool:{name}` | JSON single pool detail | 120s | Polled in the same background loop as SMART data. #### 5. API Endpoints **`GET /api/zfs/pools`** Returns all pools with full health, vdev tree, scrub status, capacity. **`GET /api/zfs/pools/{name}`** Single pool detail with full vdev tree expanded. **`GET /api/zfs/alerts`** Active alerts only. Filterable by `?severity=critical&pool=tank`. **`GET /api/overview`** (extended) Add top-level fields: ```json { "healthy": true, "drive_count": 78, "warning_count": 4, "error_count": 0, "zfs_healthy": true, "zfs_alerts": [ { "severity": "warning", "pool": "archive", "category": "scrub", "message": "Last scrub completed 18 days ago", "device": null, "slot": null, "enclosure_id": null } ], "zfs_pools": [ {"name": "tank", "state": "ONLINE", "capacity_pct": 72.3, "drive_count": 46}, {"name": "fast", "state": "ONLINE", "capacity_pct": 45.1, "drive_count": 12}, {"name": "archive", "state": "ONLINE", "capacity_pct": 68.8, "drive_count": 18} ], "enclosures": [ ... ] } ``` #### 6. Frontend Integration Extend the existing dashboard: - **New alert banner** at the top of the page (above stat cards) when `zfs_alerts` is non-empty — color-coded by severity, dismissable per session - **Pool health stat cards** — add a row below the existing drive stats showing each pool's state, capacity bar, and last scrub age - **Grid view overlay** — when a ZFS alert references a specific device, pulse/highlight that slot in the enclosure grid (reuse the warning amber or add a new "zfs-error" red ring) - **Drive detail modal** — add a "ZFS Status" section below the existing ZFS Membership card showing the vdev path, error counts (read/write/checksum) for that specific drive within its pool ### Acceptance Criteria - [ ] `services/zfs.py` parses `zpool status -P` and `zpool list -Hp` into typed models - [ ] Alert engine generates alerts from pool health with configurable thresholds - [ ] Alerts cross-reference enclosure topology to map device → slot → enclosure - [ ] Background poller includes ZFS data in the same poll loop - [ ] ZFS data cached in Redis with appropriate TTLs - [ ] `/api/zfs/pools`, `/api/zfs/alerts` endpoints working - [ ] `/api/overview` extended with `zfs_healthy`, `zfs_alerts`, `zfs_pools` fields - [ ] Frontend: alert banner, pool stat cards, drive-level error counts in detail modal - [ ] Graceful handling when ZFS is not installed or no pools exist
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: adamksmith/jbod-monitor#2