Monitoring Xferity — Prometheus Metrics, Health Endpoints, Logs, and Alert Rules

Monitoring

Monitoring in Xferity means tracking service health, transfer activity, and evidence signals across the runtime. The goal is to detect operational problems before a partner escalation is the first sign something failed.

Monitoring at a glance

Layer	Endpoint / mechanism	Auth
Worker readiness	`GET /health/worker`	none
Service health	`GET /health`	required
Secret provider health	`GET /health/secrets`	required
Certificate health	`GET /health/certificates`	required
Prometheus metrics	`GET /metrics`	bearer token
Application logs	STDOUT / file (JSON)	file system
Per-flow logs	`<log_path>/<flow>.log`	file system
Audit events	JSONL file	file system

Health check endpoints

`/health/worker` (unauthenticated)

Returns HTTP 200 when the worker is ready. HTTP 503 when not ready or not running.

Use for Docker health checks, Kubernetes readiness probes, and load balancer health checks:

# Docker Compose
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health/worker"]
  interval: 15s
  timeout: 5s
  retries: 3

# Kubernetes
readinessProbe:
  httpGet:
    path: /health/worker
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15

`/health`, `/health/secrets`, `/health/certificates` (authenticated)

These endpoints require a valid session token or bearer token. Use them for administrative monitoring dashboards.

Endpoint	Checks
`/health`	state backend writability, audit path reachability
`/health/secrets`	each configured secret provider resolves
`/health/certificates`	certificate expiry windows for all active certs

Prometheus metrics

Enabling the `/metrics` endpoint

The metrics endpoint is enabled when the UI is running (xferity ui or xferity run-service with ui.enabled: true).

Authentication uses the same bearer token as the API. Configure a Prometheus scrape job with the admin token:

scrape_configs:
  - job_name: "xferity"
    static_configs:
      - targets: ["xferity.internal:8080"]
    bearer_token: "<admin-api-token>"
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/certs/xferity-ca.pem

For environments without TLS on the internal scrape path (behind a reverse proxy), use scheme: http and restrict the metrics path to internal networks via firewall rules.

Metric reference

Metric	Type	Labels	Description
`flow_run_total`	counter	`flow`, `status`	Total flow execution attempts. `status` is `success` or `failed`.
`flow_run_duration_seconds`	histogram	`flow`	Flow execution duration in seconds.
`file_transfer_total`	counter	`flow`, `outcome`	Files processed. `outcome` is `success`, `failed`, or `skipped`.
`file_transfer_bytes_total`	counter	`flow`, `direction`	Bytes transferred. `direction` is `upload` or `download`.
`sftp_transfer_failures_total`	counter	`flow`	SFTP-specific transfer failure count.
`job_enqueue_total`	counter	`flow`	Jobs enqueued in the PostgreSQL job queue.
`job_complete_total`	counter	`flow`, `status`	Jobs completed. `status` is `success`, `failed`, or `retried`.
`job_queue_depth`	gauge	—	Current number of pending/running jobs in the queue.
`lock_wait_seconds`	histogram	`flow`	Time spent waiting for a flow lock.
`mft_auth_failures_total`	counter	`channel`	Authentication failure count. `channel` is `local` or `oidc`.
`mft_rate_limit_denied_total`	counter	`scope`	Requests denied by the rate limiter. `scope` is `global` or `as2_partner`.
`mft_oidc_login_states_current`	gauge	—	Current number of pending OIDC login state objects (cap: 5000).
`mft_certificates_expired_current`	gauge	—	Number of currently expired active certificates.
`mft_certificates_expiring_soon_current`	gauge	—	Number of active certificates within the expiry warning window.

Useful PromQL queries

# Flow failure rate over 10 minutes (per flow)
sum by (flow) (increase(file_transfer_total{outcome="failed"}[10m]))
/
clamp_min(sum by (flow) (increase(file_transfer_total[10m])), 1)

# Job queue — is it growing?
delta(job_queue_depth[5m])

# Transfer throughput (bytes/sec)
sum by (flow)(rate(file_transfer_bytes_total[5m]))

# Auth failures (all channels, last 5m)
sum(increase(mft_auth_failures_total[5m]))

Prometheus alert rules

Save to prometheus/rules/xferity.yml and reference it from your Prometheus config with rule_files.

groups:
  - name: xferity.rules
    rules:

      - alert: XferityHighTransferFailureRate
        expr: increase(sftp_transfer_failures_total[5m]) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High transfer failures in last 5 minutes"
          description: "SFTP transfer failures exceeded threshold for flow {{ $labels.flow }}"

      - alert: XferityFileTransferFailureRatioHigh
        expr: |
          (
            sum by (flow) (increase(file_transfer_total{outcome="failed"}[10m]))
            /
            clamp_min(sum by (flow) (increase(file_transfer_total[10m])), 1)
          ) > 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "File transfer failure ratio is high"
          description: "Failure ratio exceeded 20% for flow {{ $labels.flow }} in the last 10 minutes"

      - alert: XferityLockWaitHigh
        expr: |
          (
            increase(lock_wait_seconds_sum[10m])
            /
            clamp_min(increase(lock_wait_seconds_count[10m]), 1)
          ) > 30
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Average lock wait is high"
          description: "Average lock wait exceeded 30s in the last 10 minutes for flow {{ $labels.flow }}"

      - alert: XferityNoSuccessfulRuns
        expr: increase(flow_run_total{status="success"}[30m]) == 0
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "No successful flow runs in the last 30 minutes"
          description: "Flow {{ $labels.flow }} has had no successful runs in the last 30 minutes"

      - alert: XferityAuthFailuresHigh
        expr: increase(mft_auth_failures_total[5m]) > 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Authentication failures are elevated"
          description: "Authentication failures exceeded 20 over 5 minutes (channel={{ $labels.channel }})"

      - alert: XferityRateLimitDenialsHigh
        expr: increase(mft_rate_limit_denied_total[5m]) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Rate-limit denials are elevated"
          description: "Rate-limit denials exceeded 50 over 5 minutes (scope={{ $labels.scope }})"

      - alert: XferityOIDCLoginStatesNearCapacity
        expr: mft_oidc_login_states_current > 4000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "OIDC pending login states nearing cap"
          description: "Pending OIDC login states are above 4000 and approaching the 5000 capacity limit"

      - alert: XferityCertificatesExpired
        expr: mft_certificates_expired_current > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Expired certificates detected"
          description: "There are {{ $value }} expired active certificate(s). Transfer flows that depend on them will fail."

      - alert: XferityCertificatesExpiringSoon
        expr: mft_certificates_expiring_soon_current > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Certificates expiring soon"
          description: "There are {{ $value }} active certificate(s) expiring within the configured warning window"

Logging

Log output configuration

logging:
  level: info              # debug, info, warn, error
  format: json             # json or console
  output: stdout           # stdout, file, or both
  path: /var/log/xferity/xferity.log   # when output is file or both
  max_size_mb: 100         # rotate when log exceeds this size (MB)
  max_backups: 7           # number of rotated log files to keep
  max_age_days: 30         # delete rotated logs older than this
  compress: true           # gzip rotated log files

Per-flow log files

When per_flow: true, Xferity writes a separate log file per flow in addition to the main log:

logging:
  output: both
  path: /var/log/xferity/xferity.log
  per_flow: true
  per_flow_path: /var/log/xferity/flows   # directory for per-flow logs

Per-flow log files are named <flow-name>.log inside per_flow_path. They contain only log lines attributed to that flow. This makes per-flow investigation much faster when the main log is high volume.

Log rotation

Log rotation is built in using max_size_mb, max_backups, max_age_days, and compress. When the active log file exceeds max_size_mb, it is rotated synchronously before the next write. Old files are pruned when they exceed max_backups count or max_age_days age.

For systems using logrotate or external rotation, set output: stdout and let the container/OS handle rotation.

Log format

With format: json (recommended for production), each log line is a JSON object:

{"ts":"2026-03-18T10:00:00.123Z","level":"info","msg":"file uploaded","flow":"payroll-upload","run_id":"run-abc123","file":"payroll-2026-03.xml","outcome":"success","bytes":451234,"duration_ms":342}

With format: console, logs are human-readable:

2026-03-18T10:00:00.123Z INFO  file uploaded   flow=payroll-upload file=payroll-2026-03.xml outcome=success

Use format: console in development. Use format: json in production for log aggregation pipelines (Loki, Elasticsearch, Splunk, CloudWatch).

Recommended minimum monitoring stack

For a production deployment:

Component	Purpose	How
Process liveness	Detect crashes	`/health/worker` probe every 15s
Prometheus scrape	Metrics trends and alerting	Scrape `/metrics` every 30s
Alert rules	Automated failure detection	Deploy the alert rule YAML above
Log aggregation	Root-cause investigation	Ship JSON logs to Loki/Elasticsearch
Audit export	Compliance evidence	Export `audit.jsonl` to cold storage on schedule
Certificate watch	Expiry prevention	`XferityCertificatesExpiringSoon` alert