Skip to content

Retry and Recovery

Xferity executes transfers as durable jobs. Failed jobs can be recovered without reprocessing completed steps.

There are two distinct recovery mechanisms:

  1. Automatic retry — transient failures are retried automatically with exponential backoff
  2. Manual resume — interrupted flows can be re-entered from their last committed state via xferity resume

Xferity distinguishes between transient errors (retryable) and permanent errors (not retried).

Transient errors — retried automatically

Section titled “Transient errors — retried automatically”

These error conditions trigger a retry:

  • Connection reset by peer
  • Broken pipe
  • Connection refused
  • Connection aborted / closed
  • Timeout / timed out
  • I/O timeout
  • Handshake timeout
  • Temporary failure
  • Unexpected EOF

Network errors (net.Error with Timeout()) are always classified as transient.

These error conditions are never retried because retrying would not help:

  • Permission denied
  • No such file or directory
  • File not found
  • Host key fingerprint mismatch
  • Host key verification failed
  • Unable to authenticate
  • Authentication failed
  • Invalid credentials
  • Unsupported operation

If a permanent error occurs, the job fails immediately without exhausting its retry attempts.

Retry behavior is configured at the worker level (applies to all jobs) and per-flow for SFTP:

Worker-level retry (PostgreSQL job queue):

worker:
max_attempts: 3 # default: 3
retry_backoff_base: 2s # default: 2s — first retry delay
retry_backoff_cap: 60s # default: 60s — maximum delay between retries
job_execution_timeout: 5m # default: 5m — kill stalled jobs

SFTP per-flow retry:

sftp:
retry:
max_attempts: 5 # default: 5
initial_delay_ms: 500 # default: 500ms
max_delay_ms: 5000 # default: 5000ms

Retries use exponential backoff with jitter:

  1. Start with initial_delay_ms (or retry_backoff_base)
  2. Each retry doubles the delay
  3. Add a random jitter of up to 50% of the current delay
  4. Cap at max_delay_ms (or retry_backoff_cap)

Example with 3 attempts, 2s base, 60s cap:

  • Attempt 1 fails → wait ~2s ± jitter
  • Attempt 2 fails → wait ~4s ± jitter
  • Attempt 3 fails → job marked failed

The jitter prevents retry storms when multiple jobs fail simultaneously.

xferity resume re-enters a flow from its last committed state. It is safe to run at any time — it will not double-process files that have already been transferred.

Terminal window
# Resume all flows with incomplete state
xferity resume
# Resume a specific flow
xferity resume payroll-sftp-upload
# Dry-run — show what would be resumed without transferring
xferity resume payroll-sftp-upload --dry-run

Xferity tracks every successfully processed file by its SHA-256 content hash (when idempotency_mode: hash, the default). On resume:

  1. The flow is re-entered from the beginning
  2. Files whose SHA-256 hash is already in the processed set are skipped
  3. Only files not yet successfully transferred are processed
  4. Normal retry and commit behavior applies to the remaining files

This means resume is safe even if the failure happened in the middle of a directory with hundreds of files — only the unprocessed files are transferred.

SituationAction
Network failure mid-transferxferity resume
Process killed unexpectedlyxferity resume
Flow timed out during a large batchxferity resume
Infrastructure maintenance windowxferity resume after maintenance
Investigating what would be re-transferredxferity resume --dry-run
xferity resumexferity run
Skips already-processed files✅ Yes✅ Yes (idempotency)
Starts from last committed state✅ Yes❌ Starts fresh
Safe after partial failure✅ Yes⚠️ Re-evaluates all files

Use xferity resume after an interruption. Use xferity run to intentionally re-run from scratch.

Xferity uses SHA-256 content hashing to prevent duplicate file processing. This is configured per-flow:

idempotency_mode: hash # default — SHA-256 content hash

When a file is successfully transferred:

  1. Its SHA-256 hash is stored in the state backend
  2. On the next run (or resume), files matching a stored hash are skipped
  3. This works across restarts, retries, and even if the same filename reappears with different content

To force reprocessing of a specific file, the idempotency record must be cleared manually (via the API or database).

Xferity prevents concurrent execution of the same flow using distributed locks. If a flow is already running and a second execution is attempted, the second execution waits up to max_lock_wait_seconds (default: 300) before failing.

security:
max_lock_wait_seconds: 300
lock_stale_after_seconds: 600

Stale locks (from crashed processes) are automatically released after lock_stale_after_seconds.

Via the web UI:

  • Dashboard — shows per-flow status and last execution result
  • Flow history — shows all execution attempts, timestamps, and outcomes

Via the CLI:

Terminal window
xferity flow status payroll-sftp-upload
xferity flow history payroll-sftp-upload

Via the API:

GET /api/flows/payroll-sftp-upload/status
GET /api/flows/payroll-sftp-upload/history

Files that fail to process after all retry attempts are moved to the dead letter directory (if configured):

local:
dead_letter_dir: /var/lib/xferity/dead-letter
dead_letter_max_bytes: 104857600 # 100MB — optional size limit

Files in the dead letter directory require manual intervention. Review the audit log for the error details before reprocessing.