Investigating Failures
Investigating Failures
Section titled “Investigating Failures”When a flow fails, the goal is to isolate what failed, why it failed, and whether it is safe to retry.
This page walks through the investigation workflow step by step.
Investigation order
Section titled “Investigation order”Start with the broadest scope and narrow down. Do not assume the problem is a network issue or a partner issue until you have read the logs.
- check
xferity flow statusfor current state across all flows - check
xferity flow history <flow>for the specific run outcome - read
xferity logs <flow>for the failure detail - run
xferity diag <flow>to check current endpoint and trust state - run
xferity trace <filename>if you have a specific file to trace - check the posture page for any related security findings
- escalate to partner if you have confirmed the problem is on their side
CLI tools for investigation
Section titled “CLI tools for investigation”# Check current status across all flowsxferity flow status
# View run history for a flowxferity flow history supplier-invoice-pickup
# Read logs (most recent)xferity logs supplier-invoice-pickup
# Run diagnosticsxferity diag supplier-invoice-pickup
# Trace a specific file in audit recordsxferity trace invoice-2026-03-15.xml
# Validate configurationxferity validateFailure scope
Section titled “Failure scope”Identify the scope first — this determines whether the fix belongs to global config, partner config, flow config, or an external dependency.
| Symptom | Probable scope |
|---|---|
| Multiple flows failing at startup | global config error |
| One partner failing across multiple flows | partner definition or endpoint issue |
| One flow failing, others fine | flow config, PGP material, or flow-specific endpoint |
| Intermittent failure matching partner schedule | endpoint intermittency or remote file availability |
| Failure after a config change | the changed config |
| Failure after a cert rotation | certificate re-binding or trust verification |
| Failure after a secret rotation | secret not resolving to new value |
Config and validation failures
Section titled “Config and validation failures”Symptoms: startup fails, flow does not load, YAML parse error in logs.
Check:
xferity validateoutput- YAML field names for typos (strict parser rejects unknown fields)
- referenced paths and keys exist
- hardened mode constraints are satisfied
Common config mistakes:
sftp.known_hostspath missingfile:prefixsftp.host_key_fingerprintnot starting withSHA256:schedule_cronusing a five-field expression instead of six-field- partner
idnot matching filename
Secret resolution failures
Section titled “Secret resolution failures”Symptoms: flow fails before any network action, error mentions credential or secret.
Check:
- env variable is set in the running process environment
- file path is mounted and readable
- vault or AWS Secrets Manager is reachable and the token/role has access
- secret reference uses the correct prefix (
env:,file:,vault:, etc.) - in hardened mode, plaintext values in sensitive fields are rejected
Use xferity diag <flow> — diagnostics include a credential resolution check.
SSH host verification failures (SFTP)
Section titled “SSH host verification failures (SFTP)”Symptoms: SFTP connection error mentioning host key.
Check:
known_hostsfile exists and is readable- the host key in the file is current (partner may have rotated keys)
host_key_fingerprintmatches the actual server fingerprint- if the partner changed SSH host keys, update the known_hosts entry
Do not set allow_insecure_host_key=true as a fix unless you also track it as an accepted finding.
FTPS TLS failures
Section titled “FTPS TLS failures”Symptoms: TLS handshake failure, certificate validation error.
Check:
- CA certificate chains are correct and complete
- server certificate is not expired
tls.modeisexplicit(implicit mode is not supported)connection.passive=trueis set- if using
server_cert_fingerprint, it matches the current server cert
AS2 failures
Section titled “AS2 failures”Symptoms: AS2 message rejected, MDN error, signing or encryption failure.
Check:
- partner AS2 ID in the config matches what the partner expects
- certificate roles are correctly bound in the Certificate inventory
- MDN signing is expected if
expect_signed_mdn=true - the receiving endpoint URL is reachable from Xferity
- HTTPS trust for the AS2 endpoint is configured
PGP decryption or encryption failures
Section titled “PGP decryption or encryption failures”Symptoms: crypto error in logs, compat_enterprise_key_structure mentioned.
Check:
- the key material is present and readable
- the passphrase resolves correctly
- the key has not expired
- if using
provider=auto, check whether fallback to GnuPG was attempted - if fallback occurred: confirm GnuPG is installed and
gnupg_binarypath is correct
For compat_enterprise_key_structure:
- this is a named compatibility case, not a bad key
- the native provider could not handle the key layout
- GnuPG fallback should handle it if configured correctly
Flow locking issues
Section titled “Flow locking issues”Symptoms: flow fails with lock error, run does not start.
Check:
- whether a previous run is still holding the lock
lock_stale_after_seconds— if the previous run died mid-execution, the lock may still exist- whether
lock_wait=trueis configured and whether max wait was exceeded
A stale lock means the previous run did not complete cleanly. Investigate why before clearing the lock.
Dead-letter artifacts
Section titled “Dead-letter artifacts”When a file ends up in the dead-letter path, the flow could not process it after exhausting retries.
Check:
- filename and timestamp to correlate with log entries
- log entries at the time of the failure
- whether the underlying issue is transient (e.g., partner downtime) or permanent (e.g., corrupt file)
Do not delete dead-letter files without understanding the failure first.
Narrowing and escalation
Section titled “Narrowing and escalation”After steps 1–6 above:
- if the fault is in Xferity config: fix, validate, and rerun
- if the fault is in the partner endpoint: confirm from logs, then contact the partner with evidence
- if the fault is transient and the cause is resolved: check whether rerun is safe with idempotency in mind