Building Resumable Uploads That Survive AI Processing Outages
apiintegrationreliability

Building Resumable Uploads That Survive AI Processing Outages

UUnknown
2026-03-01
11 min read
Advertisement

Developer guide to build resumable uploads and durable queues so AI/CDN outages (Grok, Cloudflare) don’t break file processing.

Hook: stop losing files and time when AI or CDN providers go down

Facing frustrated users, audit headaches, and manual re-uploads after a vendor outage? In 2025–2026 we've seen large-scale incidents — from Cloudflare interruptions that knocked services offline to AI provider outages and moderation controversies around Grok — that make one thing obvious: file upload pipelines that assume always-on third-party processing will fail users and compliance checks.

This guide shows how to design and implement resumable uploads and durable queues so uploads complete reliably, processing can be deferred and retried safely, and your file upload API keeps predictable semantics even during third-party outages.

Executive summary — what to build

  • Client-side resumable uploads with chunking, checksums, and resume manifests so uploads finish even across network interruptions.
  • Server-side resumable endpoints that persist state and stitch chunks into durable objects (S3, object store) atomically.
  • Durable processing queues (SQS, Redis Streams, Kafka) to decouple upload completion from AI processing.
  • Strong retry semantics and idempotency so repeated processing attempts don't produce duplicate side effects.
  • Circuit breakers, backoffs, and alternate paths when third-party AI/CDN providers are degraded.
  • Access controls and staging — do not serve files publicly until moderation succeeds or is deferred with safeguards.

Late 2025 and early 2026 saw multiple high-profile outages and moderation incidents. Cloudflare's incidents cascaded into platform downtime; xAI's Grok was involved in high-profile content moderation lawsuits and availability concerns. These events pushed engineering teams to assume external AI and CDN services will be unavailable at times, and to design for graceful degradation.

"X Is Down: Problems stemmed from the cybersecurity services provider Cloudflare" — January 2026 news

The practical result: more customers demand systems that complete uploads locally, stash the asset safely, and defer processing until external services recover — without leaking content or requiring manual intervention.

Core architecture

Design a pipeline with clear separation of concerns:

  1. Ingest — resumable upload API that stores chunks and persists upload metadata.
  2. Store — durable object storage (S3, GCS, Ceph) with immutability options and server-side encryption.
  3. Queue — durable message queue for processing jobs, with retry and dead-letter handling.
  4. Process — worker pool that calls AI services (virus scan, moderation). Workers implement idempotency and circuit-breakers.
  5. Serve — only expose files publicly if they pass checks, otherwise restrict and log.

Data model (simple)

Persist a small upload record per file:

  • upload_id (UUID)
  • uploader_id
  • state: uploading | uploaded | processing | approved | rejected | deferred
  • chunks: map of offsets -> checksum/status
  • file_sha256 and size
  • processing_attempts and last_error
  • created_at, updated_at

Client-side resumable uploads — practical recipe

Goal: let clients upload large files across flaky networks and resume without restarting. Use chunked uploads with checksums and a small session manifest.

Session lifecycle

  1. Client requests a session: POST /uploads → returns upload_id, chunk_size, upload_url.
  2. Client uploads numbered chunks (PATCH /uploads/{id}/chunks or PUT to presigned URLs).
  3. Client verifies server-reported offsets and checksums; resumes missing chunks.
  4. Client finalizes: POST /uploads/{id}/complete.

Why not only TUS?

TUS is a well-established standard for resumable uploads. It works, but you may need custom fields: hashing, idempotency keys, access controls, or integration with your queue. Use TUS if it fits; otherwise implement the simple pattern below.

Client example (fetch + exponential retry)

// simplified browser pseudo-code
async function uploadChunk(uploadUrl, chunk, offset, maxRetries = 5) {
  const checksum = await sha256(chunk);
  let attempt = 0;
  while (attempt <= maxRetries) {
    try {
      const res = await fetch(`${uploadUrl}/chunks`, {
        method: 'PATCH',
        headers: {
          'Content-Type': 'application/octet-stream',
          'Upload-Offset': offset,
          'Upload-Checksum': checksum,
          'Idempotency-Key': `${uploadId}:${offset}`
        },
        body: chunk
      });
      if (res.status === 200) return await res.json();
      if (res.status >= 400 && res.status < 500) throw new Error('Client error');
    } catch (err) {
      attempt++;
      await sleep(expBackoffMs(attempt));
    }
  }
  throw new Error('Max retries exhausted');
}

Client resume strategy

  • Persist session metadata (upload_id, uploaded offsets, file_sha256) in local storage.
  • On restart, query GET /uploads/{id}/status to learn which chunks are present.
  • Only re-upload missing chunks; use Idempotency-Key per chunk so retries are safe.

Server-side resumable endpoints — durable and consistent

Server endpoints must persist partial state in a durable store (RDS, DynamoDB, or Postgres) and avoid relying on in-memory state that disappears on pod restarts.

Chunk ingestion patterns

  • Store chunk metadata (offset, length, checksum) in DB; keep the chunk in object storage (S3 multipart or temp-prefixed objects).
  • Use object storage multipart upload for large files to avoid assembling bytes in the app node.
  • Commit the multipart only after all chunks verified and client finalizes.

Example server flow (outline)

// Pseudocode for PATCH /uploads/{id}/chunks
validateAuth();
const upload = await db.getUpload(id);
assert(upload.state === 'uploading');
if (!validateOffset(upload, req.headers['Upload-Offset'])) return 409;
// Save chunk to temp storage (S3 part or object with key upload_id/chunk_offset)
await s3.putChunk(uploadId, offset, req.body);
await db.insertChunkRecord(uploadId, offset, checksum);
return 200;

Finalize and enqueue

On client finalize (POST /uploads/{id}/complete):

  1. Verify all chunk checksums and compute final file SHA256.
  2. Complete S3 multipart commit or assemble object atomically.
  3. Set upload.state = uploaded and enqueue a processing job with file_sha256 and idempotency key.

Durable queues and processing — resilient to AI downtime

Decouple processing from upload. Even if an AI vendor is down, the upload pipeline should finish and processing should resume later without lost state.

Queue choice

  • Cloud: AWS SQS (visibility timeout, DLQ), GCP Pub/Sub, Azure Service Bus.
  • Self-hosted: Kafka (partitioning & retention), Redis Streams (consumer groups), RabbitMQ.

Design requirements

  • Durability: messages survive broker restarts.
  • Visibility & leasing: worker has time to process before the message returns.
  • Dead-letter queue: messages that repeatedly fail go to DLQ for manual inspection.
  • Idempotent processing: processing is safe to repeat.

Processing worker responsibilities

  1. Grab job; mark attempt; test for transient vs permanent errors after AI call.
  2. If AI service returns 5xx or network error, requeue with exponential backoff and increment attempt count.
  3. If AI returns deterministic rejection (policy fail), mark file rejected or flagged.
  4. If processing succeeds, mark approved and publish events to downstream (CDN, notifications).

Backoff and retry semantics

Use exponential backoff with jitter. Example schedule:

  • Attempt 1: immediate
  • Attempt 2: 30s + jitter
  • Attempt 3: 2m + jitter
  • Attempt 4: 10m + jitter
  • Attempt N: escalate to DLQ after configurable max_attempts (e.g., 10)

Differentiate permanent errors (invalid file, 4xx) vs transient (timeouts, 5xx). Only transient errors merit requeueing.

Sample retry pseudocode

async function processJob(job) {
  try {
    const res = await callAiService(job.fileUrl);
    if (isSuccess(res)) return markApproved(job);
    if (isPermanentError(res)) return markRejected(job, res.error);
    // transient
    const nextDelay = computeBackoff(job.attempt);
    await queue.requeue(job, nextDelay);
  } catch (err) {
    // network/transient
    await queue.requeue(job, computeBackoff(job.attempt));
  }
}

Idempotency: the key to safe retries

Idempotency prevents duplicate side effects when processing or uploading retries occur. Use a deterministic idempotency key for each logical operation.

  • For uploads: idempotency-key = upload_id + chunk_offset.
  • For processing jobs: idempotency-key = file_sha256 or upload_id.
  • Store processed results keyed by the idempotency key; if seen, return cached result.

Persist the idempotency store in the same durable DB used for metadata. Keep keys with TTL long enough for retries and audits.

Handling third-party AI outages (Grok downtime and similar)

When the AI vendor is down, your system should:

  • Finish the file upload and set state = uploaded.
  • Enqueue processing and mark job as pending; do not make the file public.
  • Expose a clear status to users: "Processing delayed due to vendor outage" and estimated retry.
  • Retain the file in restricted storage and audit access attempts.

Circuit breaker pattern

Implement a circuit breaker around each external AI provider. If error rates exceed a threshold, open the circuit and stop calling that provider for a cooling period:

  • Failure threshold: e.g., 5 errors within 60s.
  • Open duration: e.g., 3 minutes + exponential growth if failures persist.
  • Once open, route to alternative provider, enqueue for retry, or escalate to human review.

Alternate strategies when AI is down

  • Failover to another AI vendor if available (multi-provider strategy).
  • Run minimal on-prem or edge models for coarse checks (face-detect, nudity heuristics) to allow low-risk serving.
  • Queue for deferred deep inspection; mark content access as restricted until cleared.

CDN outages and serving strategy (Cloudflare example)

Cloudflare and other CDNs can introduce global outages. Protect against this by:

  • Using origin fallback — keep signed URLs that bypass CDN if CDN health check fails.
  • Serving critical assets from multiple regions or providers to avoid single CDN dependency.
  • Keeping a short, signed URL window in case of reissuance required during outage.

Access control and safety during downtime

Never make user-uploaded content publicly accessible before it passes required checks. Recommended steps:

  • Store uploads in private buckets; use short-lived signed URLs for previews if needed.
  • Show a clear UI state: "Upload complete — pending moderation".
  • Log all access attempts and provide an audit trail for compliance (GDPR/HIPAA where applicable).

Observability and SLAs

Instrument these signals:

  • Upload success rate and median time to complete upload.
  • Queue depth and oldest message age — key indicator of downstream outages.
  • AI call error rate, latency, and circuit-breaker state per provider.
  • Number of messages in DLQ and reasons.

Create alerts on queue depth or oldest-age thresholds to avoid silent backlogs. Define SLOs for end-to-end time: upload-to-moderation-success and upload-to-public-serve.

Security, privacy, and compliance

When processing can be deferred, retain encryption and access controls at every step. Practical controls:

  • Server-side encryption for object storage and encrypted transport (TLS 1.3).
  • Role-based access for workers that call third-party AI services; log minimal PII to providers.
  • Data retention policy and purge for files deferred too long (e.g., policy: defer > 30 days → review & delete).
  • Audit logs to support legal requests and incident investigations (example: Grok-related lawsuits require auditability of prompts/outputs).

Real-world scenario: 2 GB video, AI vendor outage

User uploads a 2 GB event video across a flaky mobile network. They use your client which implements chunked resumable uploads. Network drops; client resumes — upload completes in background.

On finalize, server commits the multipart object and enqueues a job: content-moderation(file_sha256, upload_id). The moderation worker calls Grok-style moderation and sees 503 errors. The circuit breaker opens and the job is requeued with exponential backoff.

Meanwhile, the user's UI shows "Processing delayed due to moderation vendor outage" and retains the file in a private bucket. If the outage persists beyond your SLA, the file goes to a human review queue or failover provider. No user-facing public link is created until approval.

Advanced strategies & 2026 predictions

  • Multi-provider orchestration will be mainstream in 2026: route low-latency checks to edge models and heavy checks to vendors.
  • Standardized retry semantics and SLA metadata from AI vendors will emerge, reducing custom circuit-breaker work.
  • On-device and edge AI for pre-filtering will reduce latency and vendor dependence for sensitive content.
  • Legal pressure after incidents (e.g., deepfake lawsuits) will force stricter audit requirements for moderation calls and storage of prompts/outputs.

Checklist: Deployable steps you can start with today

  1. Enable resumable uploads for files > 10 MB. Use chunking + checksum + Idempotency-Key.
  2. Persist upload state in durable DB and use S3 multipart to avoid in-memory assembly.
  3. Decouple processing using a durable queue with DLQ and visibility timeouts.
  4. Implement idempotency for processing jobs keyed by upload_id/file_sha256.
  5. Add circuit breakers and backoff with jitter for each external AI provider.
  6. Keep uploads private until moderation passes; provide clear UI status for users.
  7. Monitor queue depth, job age, AI error rates, and set alerts for SLA breaches.

Quick reference: headers and API patterns

  • Start session: POST /uploads — returns upload_id, recommended chunk_size.
  • Upload chunk: PATCH /uploads/{id}/chunks with headers Upload-Offset, Upload-Checksum, Idempotency-Key.
  • Query status: GET /uploads/{id}/status — returns uploaded offsets and checksums.
  • Finalize: POST /uploads/{id}/complete — server verifies chunks, commits object, enqueues processing job.
  • Processing job: message { upload_id, file_sha256, idempotency_key, attempts } in queue store.

Closing — actionable takeaways

Start by decoupling: ensure uploads succeed independently of processing. Use resumable client uploads plus durable server-side storage. Enqueue processing jobs and implement robust retry and idempotency logic so third-party AI or CDN outages (like recent Cloudflare or Grok interruptions) do not lose data or create inconsistent states. Protect access to files until checks pass, and instrument queue depth and oldest-job-age as your primary outage indicators.

If you take one action today: implement server-persisted upload sessions and a durable queue with DLQ. You will stop user-facing failures and gain the time to run remediation and auditing workflows for AI vendor incidents.

Call to action

Ready to implement resumable uploads and durable queues in your stack? Start a proof-of-concept: add chunked resumable endpoints, persist upload sessions, and wire a durable queue for processing. If you want sample code tailored to your stack (Node, Go, Python) or an architecture review for handling AI/CDN outage scenarios, contact our engineering team or try a free trial of our upload API to test resumable flows end-to-end.

Advertisement

Related Topics

#api#integration#reliability
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-01T08:27:55.013Z