Monitoring Xferity — Prometheus Metrics, Health Endpoints, Logs, and Alert Rules
Monitoring
Section titled “Monitoring”Monitoring in Xferity means tracking service health, transfer activity, and evidence signals across the runtime. The goal is to detect operational problems before a partner escalation is the first sign something failed.
Monitoring at a glance
Section titled “Monitoring at a glance”| Layer | Endpoint / mechanism | Auth |
|---|---|---|
| Worker readiness | GET /health/worker | none |
| Service health | GET /health | required |
| Secret provider health | GET /health/secrets | required |
| Certificate health | GET /health/certificates | required |
| Prometheus metrics | GET /metrics | bearer token |
| Application logs | STDOUT / file (JSON) | file system |
| Per-flow logs | <log_path>/<flow>.log | file system |
| Audit events | JSONL file | file system |
Health check endpoints
Section titled “Health check endpoints”/health/worker (unauthenticated)
Section titled “/health/worker (unauthenticated)”Returns HTTP 200 when the worker is ready. HTTP 503 when not ready or not running.
Use for Docker health checks, Kubernetes readiness probes, and load balancer health checks:
# Docker Composehealthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health/worker"] interval: 15s timeout: 5s retries: 3# KubernetesreadinessProbe: httpGet: path: /health/worker port: 8080 initialDelaySeconds: 10 periodSeconds: 15/health, /health/secrets, /health/certificates (authenticated)
Section titled “/health, /health/secrets, /health/certificates (authenticated)”These endpoints require a valid session token or bearer token. Use them for administrative monitoring dashboards.
| Endpoint | Checks |
|---|---|
/health | state backend writability, audit path reachability |
/health/secrets | each configured secret provider resolves |
/health/certificates | certificate expiry windows for all active certs |
Prometheus metrics
Section titled “Prometheus metrics”Enabling the /metrics endpoint
Section titled “Enabling the /metrics endpoint”The metrics endpoint is enabled when the UI is running (xferity ui or xferity run-service with ui.enabled: true).
Authentication uses the same bearer token as the API. Configure a Prometheus scrape job with the admin token:
scrape_configs: - job_name: "xferity" static_configs: - targets: ["xferity.internal:8080"] bearer_token: "<admin-api-token>" scheme: https tls_config: ca_file: /etc/prometheus/certs/xferity-ca.pemFor environments without TLS on the internal scrape path (behind a reverse proxy), use scheme: http and restrict the metrics path to internal networks via firewall rules.
Metric reference
Section titled “Metric reference”| Metric | Type | Labels | Description |
|---|---|---|---|
flow_run_total | counter | flow, status | Total flow execution attempts. status is success or failed. |
flow_run_duration_seconds | histogram | flow | Flow execution duration in seconds. |
file_transfer_total | counter | flow, outcome | Files processed. outcome is success, failed, or skipped. |
file_transfer_bytes_total | counter | flow, direction | Bytes transferred. direction is upload or download. |
sftp_transfer_failures_total | counter | flow | SFTP-specific transfer failure count. |
job_enqueue_total | counter | flow | Jobs enqueued in the PostgreSQL job queue. |
job_complete_total | counter | flow, status | Jobs completed. status is success, failed, or retried. |
job_queue_depth | gauge | — | Current number of pending/running jobs in the queue. |
lock_wait_seconds | histogram | flow | Time spent waiting for a flow lock. |
mft_auth_failures_total | counter | channel | Authentication failure count. channel is local or oidc. |
mft_rate_limit_denied_total | counter | scope | Requests denied by the rate limiter. scope is global or as2_partner. |
mft_oidc_login_states_current | gauge | — | Current number of pending OIDC login state objects (cap: 5000). |
mft_certificates_expired_current | gauge | — | Number of currently expired active certificates. |
mft_certificates_expiring_soon_current | gauge | — | Number of active certificates within the expiry warning window. |
Useful PromQL queries
Section titled “Useful PromQL queries”# Flow failure rate over 10 minutes (per flow)sum by (flow) (increase(file_transfer_total{outcome="failed"}[10m]))/clamp_min(sum by (flow) (increase(file_transfer_total[10m])), 1)
# Job queue — is it growing?delta(job_queue_depth[5m])
# Transfer throughput (bytes/sec)sum by (flow)(rate(file_transfer_bytes_total[5m]))
# Auth failures (all channels, last 5m)sum(increase(mft_auth_failures_total[5m]))Prometheus alert rules
Section titled “Prometheus alert rules”Save to prometheus/rules/xferity.yml and reference it from your Prometheus config with rule_files.
groups: - name: xferity.rules rules:
- alert: XferityHighTransferFailureRate expr: increase(sftp_transfer_failures_total[5m]) > 5 for: 5m labels: severity: warning annotations: summary: "High transfer failures in last 5 minutes" description: "SFTP transfer failures exceeded threshold for flow {{ $labels.flow }}"
- alert: XferityFileTransferFailureRatioHigh expr: | ( sum by (flow) (increase(file_transfer_total{outcome="failed"}[10m])) / clamp_min(sum by (flow) (increase(file_transfer_total[10m])), 1) ) > 0.2 for: 10m labels: severity: warning annotations: summary: "File transfer failure ratio is high" description: "Failure ratio exceeded 20% for flow {{ $labels.flow }} in the last 10 minutes"
- alert: XferityLockWaitHigh expr: | ( increase(lock_wait_seconds_sum[10m]) / clamp_min(increase(lock_wait_seconds_count[10m]), 1) ) > 30 for: 10m labels: severity: warning annotations: summary: "Average lock wait is high" description: "Average lock wait exceeded 30s in the last 10 minutes for flow {{ $labels.flow }}"
- alert: XferityNoSuccessfulRuns expr: increase(flow_run_total{status="success"}[30m]) == 0 for: 30m labels: severity: critical annotations: summary: "No successful flow runs in the last 30 minutes" description: "Flow {{ $labels.flow }} has had no successful runs in the last 30 minutes"
- alert: XferityAuthFailuresHigh expr: increase(mft_auth_failures_total[5m]) > 20 for: 5m labels: severity: warning annotations: summary: "Authentication failures are elevated" description: "Authentication failures exceeded 20 over 5 minutes (channel={{ $labels.channel }})"
- alert: XferityRateLimitDenialsHigh expr: increase(mft_rate_limit_denied_total[5m]) > 50 for: 5m labels: severity: warning annotations: summary: "Rate-limit denials are elevated" description: "Rate-limit denials exceeded 50 over 5 minutes (scope={{ $labels.scope }})"
- alert: XferityOIDCLoginStatesNearCapacity expr: mft_oidc_login_states_current > 4000 for: 2m labels: severity: warning annotations: summary: "OIDC pending login states nearing cap" description: "Pending OIDC login states are above 4000 and approaching the 5000 capacity limit"
- alert: XferityCertificatesExpired expr: mft_certificates_expired_current > 0 for: 1m labels: severity: critical annotations: summary: "Expired certificates detected" description: "There are {{ $value }} expired active certificate(s). Transfer flows that depend on them will fail."
- alert: XferityCertificatesExpiringSoon expr: mft_certificates_expiring_soon_current > 0 for: 5m labels: severity: warning annotations: summary: "Certificates expiring soon" description: "There are {{ $value }} active certificate(s) expiring within the configured warning window"Logging
Section titled “Logging”Log output configuration
Section titled “Log output configuration”logging: level: info # debug, info, warn, error format: json # json or console output: stdout # stdout, file, or both path: /var/log/xferity/xferity.log # when output is file or both max_size_mb: 100 # rotate when log exceeds this size (MB) max_backups: 7 # number of rotated log files to keep max_age_days: 30 # delete rotated logs older than this compress: true # gzip rotated log filesPer-flow log files
Section titled “Per-flow log files”When per_flow: true, Xferity writes a separate log file per flow in addition to the main log:
logging: output: both path: /var/log/xferity/xferity.log per_flow: true per_flow_path: /var/log/xferity/flows # directory for per-flow logsPer-flow log files are named <flow-name>.log inside per_flow_path. They contain only log lines attributed to that flow. This makes per-flow investigation much faster when the main log is high volume.
Log rotation
Section titled “Log rotation”Log rotation is built in using max_size_mb, max_backups, max_age_days, and compress. When the active log file exceeds max_size_mb, it is rotated synchronously before the next write. Old files are pruned when they exceed max_backups count or max_age_days age.
For systems using logrotate or external rotation, set output: stdout and let the container/OS handle rotation.
Log format
Section titled “Log format”With format: json (recommended for production), each log line is a JSON object:
{"ts":"2026-03-18T10:00:00.123Z","level":"info","msg":"file uploaded","flow":"payroll-upload","run_id":"run-abc123","file":"payroll-2026-03.xml","outcome":"success","bytes":451234,"duration_ms":342}With format: console, logs are human-readable:
2026-03-18T10:00:00.123Z INFO file uploaded flow=payroll-upload file=payroll-2026-03.xml outcome=successUse format: console in development. Use format: json in production for log aggregation pipelines (Loki, Elasticsearch, Splunk, CloudWatch).
Recommended minimum monitoring stack
Section titled “Recommended minimum monitoring stack”For a production deployment:
| Component | Purpose | How |
|---|---|---|
| Process liveness | Detect crashes | /health/worker probe every 15s |
| Prometheus scrape | Metrics trends and alerting | Scrape /metrics every 30s |
| Alert rules | Automated failure detection | Deploy the alert rule YAML above |
| Log aggregation | Root-cause investigation | Ship JSON logs to Loki/Elasticsearch |
| Audit export | Compliance evidence | Export audit.jsonl to cold storage on schedule |
| Certificate watch | Expiry prevention | XferityCertificatesExpiringSoon alert |