Retry and Recovery
Retry and Recovery
Section titled “Retry and Recovery”Xferity executes transfers as durable jobs. Failed jobs can be recovered without reprocessing completed steps.
There are two distinct recovery mechanisms:
- Automatic retry — transient failures are retried automatically with exponential backoff
- Manual resume — interrupted flows can be re-entered from their last committed state via
xferity resume
Automatic retry
Section titled “Automatic retry”Xferity distinguishes between transient errors (retryable) and permanent errors (not retried).
Transient errors — retried automatically
Section titled “Transient errors — retried automatically”These error conditions trigger a retry:
- Connection reset by peer
- Broken pipe
- Connection refused
- Connection aborted / closed
- Timeout / timed out
- I/O timeout
- Handshake timeout
- Temporary failure
- Unexpected EOF
Network errors (net.Error with Timeout()) are always classified as transient.
Permanent errors — not retried
Section titled “Permanent errors — not retried”These error conditions are never retried because retrying would not help:
- Permission denied
- No such file or directory
- File not found
- Host key fingerprint mismatch
- Host key verification failed
- Unable to authenticate
- Authentication failed
- Invalid credentials
- Unsupported operation
If a permanent error occurs, the job fails immediately without exhausting its retry attempts.
Retry configuration
Section titled “Retry configuration”Retry behavior is configured at the worker level (applies to all jobs) and per-flow for SFTP:
Worker-level retry (PostgreSQL job queue):
worker: max_attempts: 3 # default: 3 retry_backoff_base: 2s # default: 2s — first retry delay retry_backoff_cap: 60s # default: 60s — maximum delay between retries job_execution_timeout: 5m # default: 5m — kill stalled jobsSFTP per-flow retry:
sftp: retry: max_attempts: 5 # default: 5 initial_delay_ms: 500 # default: 500ms max_delay_ms: 5000 # default: 5000msBackoff algorithm
Section titled “Backoff algorithm”Retries use exponential backoff with jitter:
- Start with
initial_delay_ms(orretry_backoff_base) - Each retry doubles the delay
- Add a random jitter of up to 50% of the current delay
- Cap at
max_delay_ms(orretry_backoff_cap)
Example with 3 attempts, 2s base, 60s cap:
- Attempt 1 fails → wait ~2s ± jitter
- Attempt 2 fails → wait ~4s ± jitter
- Attempt 3 fails → job marked failed
The jitter prevents retry storms when multiple jobs fail simultaneously.
Manual resume with xferity resume
Section titled “Manual resume with xferity resume”xferity resume re-enters a flow from its last committed state. It is safe to run at any time — it will not double-process files that have already been transferred.
# Resume all flows with incomplete statexferity resume
# Resume a specific flowxferity resume payroll-sftp-upload
# Dry-run — show what would be resumed without transferringxferity resume payroll-sftp-upload --dry-runHow resume works
Section titled “How resume works”Xferity tracks every successfully processed file by its SHA-256 content hash (when idempotency_mode: hash, the default). On resume:
- The flow is re-entered from the beginning
- Files whose SHA-256 hash is already in the processed set are skipped
- Only files not yet successfully transferred are processed
- Normal retry and commit behavior applies to the remaining files
This means resume is safe even if the failure happened in the middle of a directory with hundreds of files — only the unprocessed files are transferred.
When to use resume
Section titled “When to use resume”| Situation | Action |
|---|---|
| Network failure mid-transfer | xferity resume |
| Process killed unexpectedly | xferity resume |
| Flow timed out during a large batch | xferity resume |
| Infrastructure maintenance window | xferity resume after maintenance |
| Investigating what would be re-transferred | xferity resume --dry-run |
Resume vs. re-run
Section titled “Resume vs. re-run”xferity resume | xferity run | |
|---|---|---|
| Skips already-processed files | ✅ Yes | ✅ Yes (idempotency) |
| Starts from last committed state | ✅ Yes | ❌ Starts fresh |
| Safe after partial failure | ✅ Yes | ⚠️ Re-evaluates all files |
Use xferity resume after an interruption. Use xferity run to intentionally re-run from scratch.
Idempotency and duplicate prevention
Section titled “Idempotency and duplicate prevention”Xferity uses SHA-256 content hashing to prevent duplicate file processing. This is configured per-flow:
idempotency_mode: hash # default — SHA-256 content hashWhen a file is successfully transferred:
- Its SHA-256 hash is stored in the state backend
- On the next run (or resume), files matching a stored hash are skipped
- This works across restarts, retries, and even if the same filename reappears with different content
To force reprocessing of a specific file, the idempotency record must be cleared manually (via the API or database).
Flow locking
Section titled “Flow locking”Xferity prevents concurrent execution of the same flow using distributed locks. If a flow is already running and a second execution is attempted, the second execution waits up to max_lock_wait_seconds (default: 300) before failing.
security: max_lock_wait_seconds: 300 lock_stale_after_seconds: 600Stale locks (from crashed processes) are automatically released after lock_stale_after_seconds.
Checking job status
Section titled “Checking job status”Via the web UI:
- Dashboard — shows per-flow status and last execution result
- Flow history — shows all execution attempts, timestamps, and outcomes
Via the CLI:
xferity flow status payroll-sftp-uploadxferity flow history payroll-sftp-uploadVia the API:
GET /api/flows/payroll-sftp-upload/statusGET /api/flows/payroll-sftp-upload/historyDead letter handling
Section titled “Dead letter handling”Files that fail to process after all retry attempts are moved to the dead letter directory (if configured):
local: dead_letter_dir: /var/lib/xferity/dead-letter dead_letter_max_bytes: 104857600 # 100MB — optional size limitFiles in the dead letter directory require manual intervention. Review the audit log for the error details before reprocessing.