Skip to content

Monitoring Xferity — Prometheus Metrics, Health Endpoints, Logs, and Alert Rules

Monitoring in Xferity means tracking service health, transfer activity, and evidence signals across the runtime. The goal is to detect operational problems before a partner escalation is the first sign something failed.

LayerEndpoint / mechanismAuth
Worker readinessGET /health/workernone
Service healthGET /healthrequired
Secret provider healthGET /health/secretsrequired
Certificate healthGET /health/certificatesrequired
Prometheus metricsGET /metricsbearer token
Application logsSTDOUT / file (JSON)file system
Per-flow logs<log_path>/<flow>.logfile system
Audit eventsJSONL filefile system

Returns HTTP 200 when the worker is ready. HTTP 503 when not ready or not running.

Use for Docker health checks, Kubernetes readiness probes, and load balancer health checks:

# Docker Compose
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health/worker"]
interval: 15s
timeout: 5s
retries: 3
# Kubernetes
readinessProbe:
httpGet:
path: /health/worker
port: 8080
initialDelaySeconds: 10
periodSeconds: 15

/health, /health/secrets, /health/certificates (authenticated)

Section titled “/health, /health/secrets, /health/certificates (authenticated)”

These endpoints require a valid session token or bearer token. Use them for administrative monitoring dashboards.

EndpointChecks
/healthstate backend writability, audit path reachability
/health/secretseach configured secret provider resolves
/health/certificatescertificate expiry windows for all active certs

The metrics endpoint is enabled when the UI is running (xferity ui or xferity run-service with ui.enabled: true).

Authentication uses the same bearer token as the API. Configure a Prometheus scrape job with the admin token:

prometheus.yml
scrape_configs:
- job_name: "xferity"
static_configs:
- targets: ["xferity.internal:8080"]
bearer_token: "<admin-api-token>"
scheme: https
tls_config:
ca_file: /etc/prometheus/certs/xferity-ca.pem

For environments without TLS on the internal scrape path (behind a reverse proxy), use scheme: http and restrict the metrics path to internal networks via firewall rules.

MetricTypeLabelsDescription
flow_run_totalcounterflow, statusTotal flow execution attempts. status is success or failed.
flow_run_duration_secondshistogramflowFlow execution duration in seconds.
file_transfer_totalcounterflow, outcomeFiles processed. outcome is success, failed, or skipped.
file_transfer_bytes_totalcounterflow, directionBytes transferred. direction is upload or download.
sftp_transfer_failures_totalcounterflowSFTP-specific transfer failure count.
job_enqueue_totalcounterflowJobs enqueued in the PostgreSQL job queue.
job_complete_totalcounterflow, statusJobs completed. status is success, failed, or retried.
job_queue_depthgaugeCurrent number of pending/running jobs in the queue.
lock_wait_secondshistogramflowTime spent waiting for a flow lock.
mft_auth_failures_totalcounterchannelAuthentication failure count. channel is local or oidc.
mft_rate_limit_denied_totalcounterscopeRequests denied by the rate limiter. scope is global or as2_partner.
mft_oidc_login_states_currentgaugeCurrent number of pending OIDC login state objects (cap: 5000).
mft_certificates_expired_currentgaugeNumber of currently expired active certificates.
mft_certificates_expiring_soon_currentgaugeNumber of active certificates within the expiry warning window.
# Flow failure rate over 10 minutes (per flow)
sum by (flow) (increase(file_transfer_total{outcome="failed"}[10m]))
/
clamp_min(sum by (flow) (increase(file_transfer_total[10m])), 1)
# Job queue — is it growing?
delta(job_queue_depth[5m])
# Transfer throughput (bytes/sec)
sum by (flow)(rate(file_transfer_bytes_total[5m]))
# Auth failures (all channels, last 5m)
sum(increase(mft_auth_failures_total[5m]))

Save to prometheus/rules/xferity.yml and reference it from your Prometheus config with rule_files.

groups:
- name: xferity.rules
rules:
- alert: XferityHighTransferFailureRate
expr: increase(sftp_transfer_failures_total[5m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High transfer failures in last 5 minutes"
description: "SFTP transfer failures exceeded threshold for flow {{ $labels.flow }}"
- alert: XferityFileTransferFailureRatioHigh
expr: |
(
sum by (flow) (increase(file_transfer_total{outcome="failed"}[10m]))
/
clamp_min(sum by (flow) (increase(file_transfer_total[10m])), 1)
) > 0.2
for: 10m
labels:
severity: warning
annotations:
summary: "File transfer failure ratio is high"
description: "Failure ratio exceeded 20% for flow {{ $labels.flow }} in the last 10 minutes"
- alert: XferityLockWaitHigh
expr: |
(
increase(lock_wait_seconds_sum[10m])
/
clamp_min(increase(lock_wait_seconds_count[10m]), 1)
) > 30
for: 10m
labels:
severity: warning
annotations:
summary: "Average lock wait is high"
description: "Average lock wait exceeded 30s in the last 10 minutes for flow {{ $labels.flow }}"
- alert: XferityNoSuccessfulRuns
expr: increase(flow_run_total{status="success"}[30m]) == 0
for: 30m
labels:
severity: critical
annotations:
summary: "No successful flow runs in the last 30 minutes"
description: "Flow {{ $labels.flow }} has had no successful runs in the last 30 minutes"
- alert: XferityAuthFailuresHigh
expr: increase(mft_auth_failures_total[5m]) > 20
for: 5m
labels:
severity: warning
annotations:
summary: "Authentication failures are elevated"
description: "Authentication failures exceeded 20 over 5 minutes (channel={{ $labels.channel }})"
- alert: XferityRateLimitDenialsHigh
expr: increase(mft_rate_limit_denied_total[5m]) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Rate-limit denials are elevated"
description: "Rate-limit denials exceeded 50 over 5 minutes (scope={{ $labels.scope }})"
- alert: XferityOIDCLoginStatesNearCapacity
expr: mft_oidc_login_states_current > 4000
for: 2m
labels:
severity: warning
annotations:
summary: "OIDC pending login states nearing cap"
description: "Pending OIDC login states are above 4000 and approaching the 5000 capacity limit"
- alert: XferityCertificatesExpired
expr: mft_certificates_expired_current > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Expired certificates detected"
description: "There are {{ $value }} expired active certificate(s). Transfer flows that depend on them will fail."
- alert: XferityCertificatesExpiringSoon
expr: mft_certificates_expiring_soon_current > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Certificates expiring soon"
description: "There are {{ $value }} active certificate(s) expiring within the configured warning window"

logging:
level: info # debug, info, warn, error
format: json # json or console
output: stdout # stdout, file, or both
path: /var/log/xferity/xferity.log # when output is file or both
max_size_mb: 100 # rotate when log exceeds this size (MB)
max_backups: 7 # number of rotated log files to keep
max_age_days: 30 # delete rotated logs older than this
compress: true # gzip rotated log files

When per_flow: true, Xferity writes a separate log file per flow in addition to the main log:

logging:
output: both
path: /var/log/xferity/xferity.log
per_flow: true
per_flow_path: /var/log/xferity/flows # directory for per-flow logs

Per-flow log files are named <flow-name>.log inside per_flow_path. They contain only log lines attributed to that flow. This makes per-flow investigation much faster when the main log is high volume.

Log rotation is built in using max_size_mb, max_backups, max_age_days, and compress. When the active log file exceeds max_size_mb, it is rotated synchronously before the next write. Old files are pruned when they exceed max_backups count or max_age_days age.

For systems using logrotate or external rotation, set output: stdout and let the container/OS handle rotation.

With format: json (recommended for production), each log line is a JSON object:

{"ts":"2026-03-18T10:00:00.123Z","level":"info","msg":"file uploaded","flow":"payroll-upload","run_id":"run-abc123","file":"payroll-2026-03.xml","outcome":"success","bytes":451234,"duration_ms":342}

With format: console, logs are human-readable:

2026-03-18T10:00:00.123Z INFO file uploaded flow=payroll-upload file=payroll-2026-03.xml outcome=success

Use format: console in development. Use format: json in production for log aggregation pipelines (Loki, Elasticsearch, Splunk, CloudWatch).


For a production deployment:

ComponentPurposeHow
Process livenessDetect crashes/health/worker probe every 15s
Prometheus scrapeMetrics trends and alertingScrape /metrics every 30s
Alert rulesAutomated failure detectionDeploy the alert rule YAML above
Log aggregationRoot-cause investigationShip JSON logs to Loki/Elasticsearch
Audit exportCompliance evidenceExport audit.jsonl to cold storage on schedule
Certificate watchExpiry preventionXferityCertificatesExpiringSoon alert