Retry and Recovery

Xferity executes transfers as durable jobs. Failed jobs can be recovered without reprocessing completed steps.

There are two distinct recovery mechanisms:

Automatic retry — transient failures are retried automatically with exponential backoff
Manual resume — interrupted flows can be re-entered from their last committed state via xferity resume

Automatic retry

Xferity distinguishes between transient errors (retryable) and permanent errors (not retried).

Transient errors — retried automatically

These error conditions trigger a retry:

Connection reset by peer
Broken pipe
Connection refused
Connection aborted / closed
Timeout / timed out
I/O timeout
Handshake timeout
Temporary failure
Unexpected EOF

Network errors (net.Error with Timeout()) are always classified as transient.

Permanent errors — not retried

These error conditions are never retried because retrying would not help:

Permission denied
No such file or directory
File not found
Host key fingerprint mismatch
Host key verification failed
Unable to authenticate
Authentication failed
Invalid credentials
Unsupported operation

If a permanent error occurs, the job fails immediately without exhausting its retry attempts.

Retry configuration

Retry behavior is configured at the worker level (applies to all jobs) and per-flow for SFTP:

Worker-level retry (PostgreSQL job queue):

worker:
  max_attempts: 3             # default: 3
  retry_backoff_base: 2s      # default: 2s — first retry delay
  retry_backoff_cap: 60s      # default: 60s — maximum delay between retries
  job_execution_timeout: 5m   # default: 5m — kill stalled jobs

SFTP per-flow retry:

sftp:
  retry:
    max_attempts: 5           # default: 5
    initial_delay_ms: 500     # default: 500ms
    max_delay_ms: 5000        # default: 5000ms

Backoff algorithm

Retries use exponential backoff with jitter:

Start with initial_delay_ms (or retry_backoff_base)
Each retry doubles the delay
Add a random jitter of up to 50% of the current delay
Cap at max_delay_ms (or retry_backoff_cap)

Example with 3 attempts, 2s base, 60s cap:

Attempt 1 fails → wait ~2s ± jitter
Attempt 2 fails → wait ~4s ± jitter
Attempt 3 fails → job marked failed

The jitter prevents retry storms when multiple jobs fail simultaneously.

Manual resume with `xferity resume`

xferity resume re-enters a flow from its last committed state. It is safe to run at any time — it will not double-process files that have already been transferred.

# Resume all flows with incomplete state
xferity resume

# Resume a specific flow
xferity resume payroll-sftp-upload

# Dry-run — show what would be resumed without transferring
xferity resume payroll-sftp-upload --dry-run

How resume works

Xferity tracks every successfully processed file by its SHA-256 content hash (when idempotency_mode: hash, the default). On resume:

The flow is re-entered from the beginning
Files whose SHA-256 hash is already in the processed set are skipped
Only files not yet successfully transferred are processed
Normal retry and commit behavior applies to the remaining files

This means resume is safe even if the failure happened in the middle of a directory with hundreds of files — only the unprocessed files are transferred.

When to use resume

Situation	Action
Network failure mid-transfer	`xferity resume`
Process killed unexpectedly	`xferity resume`
Flow timed out during a large batch	`xferity resume`
Infrastructure maintenance window	`xferity resume` after maintenance
Investigating what would be re-transferred	`xferity resume --dry-run`

Resume vs. re-run

	`xferity resume`	`xferity run`
Skips already-processed files	✅ Yes	✅ Yes (idempotency)
Starts from last committed state	✅ Yes	❌ Starts fresh
Safe after partial failure	✅ Yes	⚠️ Re-evaluates all files

Use xferity resume after an interruption. Use xferity run to intentionally re-run from scratch.

Idempotency and duplicate prevention

Xferity uses SHA-256 content hashing to prevent duplicate file processing. This is configured per-flow:

idempotency_mode: hash   # default — SHA-256 content hash

When a file is successfully transferred:

Its SHA-256 hash is stored in the state backend
On the next run (or resume), files matching a stored hash are skipped
This works across restarts, retries, and even if the same filename reappears with different content

To force reprocessing of a specific file, the idempotency record must be cleared manually (via the API or database).

Flow locking

Xferity prevents concurrent execution of the same flow using distributed locks. If a flow is already running and a second execution is attempted, the second execution waits up to max_lock_wait_seconds (default: 300) before failing.

security:
  max_lock_wait_seconds: 300
  lock_stale_after_seconds: 600

Stale locks (from crashed processes) are automatically released after lock_stale_after_seconds.

Checking job status

Via the web UI:

Dashboard — shows per-flow status and last execution result
Flow history — shows all execution attempts, timestamps, and outcomes

Via the CLI:

xferity flow status payroll-sftp-upload
xferity flow history payroll-sftp-upload

Via the API:

GET /api/flows/payroll-sftp-upload/status
GET /api/flows/payroll-sftp-upload/history

Dead letter handling

Files that fail to process after all retry attempts are moved to the dead letter directory (if configured):

local:
  dead_letter_dir: /var/lib/xferity/dead-letter
  dead_letter_max_bytes: 104857600  # 100MB — optional size limit

Files in the dead letter directory require manual intervention. Review the audit log for the error details before reprocessing.

Retry and Recovery

Retry and Recovery

Automatic retry

Transient errors — retried automatically

Permanent errors — not retried

Retry configuration

Backoff algorithm

Manual resume with xferity resume

How resume works

When to use resume

Resume vs. re-run

Idempotency and duplicate prevention

Flow locking

Checking job status

Dead letter handling

See Also

Manual resume with `xferity resume`