The Feed Health Monitor I’d Actually Deploy

Intro

Two months ago I posted a demo where Claude triaged a broken SIEM feed end to end. It worked. Then I posted about how risky that same shape is when you wire it into production. Both were true. Both bothered me.

So I went quiet for a while and built the version I’d actually run.

This post is that version. A small Python service that watches every feed in your Google SecOps tenant, runs three independent checks per feed, tries to fix what it can, files one Jira ticket per outage with a short Gemini summary, and emails the feed owner before anyone has to triage it.

The LLM is in the system. It is not driving the system. That distinction is the entire point.

Why Feed Health Is Still A Problem In 2026

Every SIEM has the same blind spot. You know the moment a rule fires. You usually do not know the moment a feed stops sending logs. The dashboard still looks green. The query returns zero results because there is no data, not because nothing is happening.

This used to be a slow problem. It is not anymore. Modern intrusions move in minutes, not hours. Initial access to lateral movement to data exfil can happen inside one shift. A feed that is broken for six hours is six hours of detection rules that cannot fire. A feed broken for two days is a serious incident waiting to be missed entirely.

Reviewing feeds by hand does not scale. You cannot click into every feed every morning and expect to catch the one that quietly stopped at 3am. The math does not work the moment you have more than a handful of sources.

So feeds need to be watched continuously, and when one breaks, the right human needs to know fast.

What Google SecOps Already Gives You

I want to be honest about prior art before talking about what this app adds. There is real work already published here, and you should know about it.

Google SecOps natively gives you three things relevant to feed health.

The Feeds page in the SecOps UI shows each feed’s state and the timestamp of the last successful transfer. If a feed authentication fails or the source is unreachable, you can see it. The error messages are sometimes helpful and sometimes not. There is no alerting attached to this view by default. You have to look.

Silent host monitoring flags individual hosts that stop sending logs after a period of activity. Useful for endpoint coverage. It is host level, not feed level, and it does not address feed configuration problems or per source ingestion drops.

The Data Ingestion and Health dashboard (the Health Hub) gives you a visual on volumes, parser errors, and feed status. Excellent at a glance. Not an alerting system on its own.

On top of that, Cloud Monitoring exposes the Chronicle ingestion metrics, so you can build standard Cloud Monitoring alert policies on bytes ingested, record counts, parser error counts, and quota rejected bytes. This is the most powerful native option, and it is what the community has been building on for the last two years.

David French published a great two part series in the Google Cloud Security community on monitoring your security data pipeline. Chris Martin (thatsiemguy on Medium) has a deep walkthrough on building Cloud Monitoring alerts from forwarder telemetry. The community thread on feed level monitoring has good back and forth on what works in practice, including the limitations.

If you are running SecOps and you are not already reading those, do that first. They are the right starting point.

Where The Gaps Are

After reading all of that, and after running this stuff in production myself, here is what I kept hitting.

The native feed status only tells you if SecOps thinks the feed is broken. SecOps does not always think a feed is broken when it is. The classic case is a storage bucket that SecOps can read, list, and authenticate against, but no new files are landing in it. Feed state says ACTIVE. Logs are zero. Detection is silent. No one knows.

Cloud Monitoring alert policies catch ingestion drops, but the rolling window is capped at about a day. Week over week comparisons against a real baseline are not something you can express in the standard alert builder. So you end up with static thresholds that are either too noisy on bursty feeds or too loose on quiet ones.

Cloud Monitoring also does not know about your feed configuration, so the alert tells you “ingestion dropped” but not “this is the AzureAD_SignIns feed, owned by Alex, here is the last successful transfer time, here are the source settings.” Building that context per alert is on you.

Restarting a feed when it breaks is also on you. Most of the time the fix is the same five clicks (disable, wait, re enable). That is fine for one feed once a quarter. It is not fine for thirty feeds.

Filing a clean ticket with all the context, looking up the owner, deciding whether a ticket already exists, deduping email noise, all of that is also on you.

So the gap I kept seeing was not “Cloud Monitoring cannot detect this.” Cloud Monitoring can detect a lot of it. The gap was that detection alone is not the job. The job is detection plus context plus a first attempt at remediation plus a clean handoff to the right human.

That is what I built.

What The App Does

It runs on a schedule as a Cloud Run Job. Every run, for every feed you have flagged as enabled, it does six things:

Pulls a 30 day baseline of ingestion volume and flags drops with a modified Z score (median plus MAD, so one bad day cannot poison the baseline).
Checks the feed state in Chronicle and runs a UDM search to confirm events are actually landing.
If something is broken, it tries to restart the feed itself (disable, wait, enable, wait, recheck).
If the feed is still broken after the restart, it files one Jira ticket. Deduped, so a stuck broken feed does not generate 48 tickets a day.
Attaches a short PROBLEM and FIX summary written by Gemini to that ticket, so whoever picks it up does not start from zero.
Emails the feed owner directly, so the right person knows before anyone has to assign the ticket.

There is no agent framework. There is no MCP. There is no model clicking buttons in a browser. A YAML file drives the whole thing. You define what gets scanned and what happens when something breaks. No code changes required.

Three Independent Checks Per Feed

This is the part I think actually matters compared to the native tools. Each check looks at a different layer, and the feed only passes if all configured checks pass. You can mix and match per feed.

feed_state asks Chronicle directly. If the platform reports anything other than ACTIVE or SUCCEEDED, the feed fails this check. This is the same signal the SecOps UI shows you, just programmatic.

gcp_metrics is the interesting one. It pulls the Cloud Monitoring ingestion metric for the last 30 days, buckets it by the hour, and compares the most recent bucket against the baseline. The baseline prefers same time of day matches, so today’s 9am bucket gets compared to prior 9am buckets, not to 3am. That respects the natural shape of your traffic.

The math is a modified Z score using median and MAD instead of mean and standard deviation. That means one freak quiet day in your history cannot drag the baseline down and hide a real outage tomorrow. This is the thing Cloud Monitoring’s standard alert builder cannot do because of the rolling window cap.

udm_search runs an actual UDM query in Chronicle and confirms events exist. This catches the “everything looks fine in metrics but the events are not parseable, or not landing in the right log type” case. Volume in metrics is not the same thing as searchable events in your SIEM. Both can break independently, and you want to know which one.

If any one of the configured checks fails, the feed is unhealthy and the actions fire.

What Happens When A Feed Fails

If auto restart is enabled, the app tries to fix it before bothering anyone. Disable the feed, wait, re enable it, wait, run all the checks again. A surprising number of feed problems are just transient and clear themselves with a kick.

If the feed is still broken after the restart, the failure actions run.

The Jira ticket has structure to it. A red banner at the top with the feed name, when it failed, the last time it ingested, and the failure message. A yellow panel if a restart was attempted. A green panel with the Gemini PROBLEM and FIX summary. Then collapsed sections with the full failure detail, the feed metadata, and a per check pass or fail breakdown.

Dedup is on by default. Before creating a ticket, the app searches for an unresolved issue with the same summary. If one exists, it skips. So a feed that has been broken for three days produces one ticket, not seventy two.

The email is plain. Subject, situation, link to the ticket. It goes to whoever is listed as the owner for that feed. If a Jira ticket already exists, the email is suppressed by default so the owner is not getting paged twice.

You can wire this to whoever you want. The feed owner. A SOC distribution list. A team alias. Per feed or per group. The notification target lives in the YAML, not in code.

What The LLM Actually Does Here

This is the part I want to be careful about, because it is where the last post went wrong on purpose.

The LLM does one thing. It reads the failure context (feed name, data type, source type, state, the per check failure messages, low signal source settings) and writes a short PROBLEM and FIX summary that gets pasted into the Jira ticket. That is the whole job.

It does not call APIs. It does not decide whether to restart the feed. It does not pick who gets the ticket. It does not edit the YAML. It writes a paragraph.

Sensitive identifiers (project ID, customer ID, feed UUIDs, hostnames, S3 URIs, Event Hub endpoints) are redacted before anything is sent to Vertex. Credentials, tokens, and the actual UDM event content never leave your project at all.

If you do not want the LLM in the loop, set investigation.llm.enabled: false. The app still works. The Jira ticket still gets filed. You just lose the friendly summary at the top.

Agentic AI is great. In a lot of cases it is also not needed. This is one of those cases.

The Config Looks Like This

Two YAML files. One for the tunables (committed to git), one for the feeds and secrets (gitignored).

A feed entry looks roughly like this:

- enabled: true
  name: AzureAD_SignIns
  chronicle_feed_id: <uuid>
  dataType: AZURE_AD
  checks:
    - feed_state
    - gcp_metrics
    - udm_search
  actions_on_failure:
    - jira
    - llm
    - email

If a feed is bursty and the default Z score is too sensitive, you override it on that feed alone:

  gcp_metrics_anomaly_threshold: 4.0
  min_expected_records: 100

That is the whole knob you need to turn. No code, no redeploy. Update the YAML, push it back to the GCS bucket, the next run picks it up.

Every threshold has a global default and a per feed override. So you set sensible defaults once, then loosen or tighten individual feeds as they earn it.

Running It

Locally is three commands once your GCP credentials are sorted:

python -m app.sync_feeds   # discover feeds from Chronicle
python -m app.main         # one monitoring pass

In production it runs as a Cloud Run Job triggered by Cloud Scheduler. The repo has the full deploy walkthrough (service account, IAM roles, Secret Manager bindings, the GCS bucket for the feeds list, the schedule). The first run ships with all outbound actions disabled on purpose. You enable Jira, email, and the global ingestion guardrail one by one once you are happy with what you are seeing.

The default schedule I run is every 30 minutes. That is enough to catch a real outage fast without hammering the Chronicle API.

A Bonus: Daily Volume Guardrail

There is a second thing in the app that is not per feed. It is a project wide ingestion volume monitor. One Cloud Monitoring call per run that sums the last 24 hours of ingestion bytes and files a ticket when you cross a threshold you set.

This is for the contract conversation. “Alert me when we hit 1 TB a day so we do not blow through the license.” It uses the same Jira and email plumbing, with the same dedup. One ticket per day per breach, not one per run.

You can also point it at the quota rejected bytes metric and set the threshold to 1 MB. That gives you an alert the moment Chronicle drops a single byte for being over quota.

This Is V1. Help Make It Better.

I want to be straight: this is version 1 of something I plan to keep working on. It does the job for me today. There are obvious next steps, and there are probably non obvious ones I will only learn about from people running it in environments I do not have.

A few things I know are missing:

Slack notifications. Cut for v1 because the implementation was not finished and I did not want to ship half a feature. Open issue in the repo.

Per feed restart cooldown so a permanently broken feed is not disabled and re enabled every 30 minutes. Right now a sparse schedule papers over this.

A SOAR integration path for shops that already use Chronicle SOAR for ticketing, instead of going direct to Jira.

A small Looker Studio dashboard wired to the same checks, so you can see feed health at a glance without opening Jira.

A Splunk variant. The pattern is the same. Only the data source layer changes.

If you run SecOps and you have an opinion on any of this, open an issue. Push a PR. Tell me what is wrong with the math. I will read it.

I also want this to be a small contribution back to the Google SecOps team and the wider community around it. The native tooling has come a long way, and the building blocks are all there: ingestion notifications, the Health Hub, Cloud Monitoring metrics. What is still on the customer is stitching them into a real “watch every feed, retry, ticket, notify” workflow. If this app helps surface what that workflow looks like in practice, and the SecOps team picks up any of it natively, that is the best possible outcome. I would happily make this obsolete the day SecOps ships the same thing built in.

Get The Code

The repo is here:

github.com/blueaisecurity/secops-feed-health

It is MIT licensed. Clone it, point it at your tenant, run the connection test, and start with one feed. Open an issue if something breaks or feels weird, that is exactly the kind of feedback that makes this better.

If you run Google SecOps and you have ever had a feed go silent on you for longer than you would like to admit, this is for you.

The Feed Health Monitor I'd Actually Deploy