Architecture¶

This document describes the current package shape and the main runtime boundaries in the implementation that ships today.

Status: reflects the current public-package implementation.

Package shape¶

batchor/
  .agents/
  docs/
  plugins/
    batchor/
    batchor-agent-tools/
  src/batchor/
    artifacts/
    cli.py
    core/
    providers/
    runtime/
    sources/
    storage/
  tests/

Agent tooling lives outside the shipped Python package so Python installation does not mutate or assume a particular coding-agent environment:

.agents/skills/batchor-dev/ contains the repo skill for AI-agent onboarding
.agents/plugins/marketplace.json registers repo-local plugins for Codex-style discovery
plugins/batchor-agent-tools/ contains contributor-only repo guidance and validation tools
plugins/batchor/ contains the downstream user skill and repo-independent workflow MCP for researchers and agents integrating Batchor into other projects

These directories are not part of the published src/batchor package. PyPI distributes the runtime and CLI; the agent marketplace distributes agent-specific skills and MCP configuration.

The package is organized around one core concern: durable batch execution. Most modules exist to support one of five responsibilities:

domain models
request/provider adaptation
execution orchestration
input streaming
durable state and artifacts

High-Level Architecture¶

graph TB
    User["User / CLI"]

    subgraph runtime["runtime/"]
        BatchRunner
        Run["Run handle"]
    end

    subgraph core["core/"]
        BatchJob
        BatchItem
        Models["Models, Enums, Exceptions"]
    end

    subgraph providers["providers/"]
        BatchProvider["BatchProvider (ABC)"]
        OpenAIProvider["OpenAIBatchProvider"]
        AnthropicProvider["AnthropicBatchProvider"]
        GeminiProvider["GeminiBatchProvider"]
        ProviderRegistry
    end

    subgraph sources["sources/"]
        ItemSource["ItemSource (ABC)"]
        FileAdapters["CSV / JSONL / Parquet"]
    end

    subgraph storage["storage/"]
        StateStore["StateStore (ABC)"]
        SQLiteStorage
        PostgresStorage
        MemoryStateStore
        StorageRegistry
    end

    subgraph artifacts["artifacts/"]
        ArtifactStore["ArtifactStore (ABC)"]
        LocalArtifactStore
    end

    User -->|"start() / run_and_wait()"| BatchRunner
    BatchRunner --> Run
    BatchRunner -->|reads| BatchJob
    BatchJob -->|contains| BatchItem
    BatchJob -->|uses| ItemSource
    FileAdapters -.->|implements| ItemSource
    BatchRunner -->|persists state| StateStore
    SQLiteStorage -.->|implements| StateStore
    PostgresStorage -.->|implements| StateStore
    MemoryStateStore -.->|implements| StateStore
    BatchRunner -->|submits/polls| BatchProvider
    OpenAIProvider -.->|implements| BatchProvider
    AnthropicProvider -.->|implements| BatchProvider
    GeminiProvider -.->|implements| BatchProvider
    BatchRunner -->|stores artifacts| ArtifactStore
    LocalArtifactStore -.->|implements| ArtifactStore
    BatchRunner -->|creates via| ProviderRegistry
    BatchRunner -->|creates via| StorageRegistry

This is the canonical diagram page for the current package shape. Keep the README diagrams compact and reader-facing; put the detailed runtime and module boundary view here.

Core runtime concepts¶

The public runtime model centers on four types:

BatchItem: one logical unit of work with stable item_id, application payload, and optional metadata.
BatchJob: the declarative execution plan that bundles items or an ItemSource, prompt-building logic, provider config, and retry/artifact policy.
BatchRunner: the orchestrator that resolves implementations, persists run state, builds or replays request artifacts, submits provider batches, polls remote status, and writes terminal results back into durable storage.
Run: the durable handle returned by start() or get_run() for refresh, wait, pause/resume/cancel, terminal result reads, and artifact export/prune.

CompositeItemSource keeps the runner contract narrow: the runner still sees one logical source, while callers remain responsible for selecting and ordering the child sources up front.

Provider adaptation is intentionally concentrated behind BatchProvider. The runtime stores one durable internal custom identifier per item attempt, while each provider maps that identifier to its own request shape. OpenAI and Anthropic use custom_id; the Gemini Developer API uses key; Vertex AI uses a request label that is returned with the original request in GCS output. Provider hooks also own response-text extraction so structured-output parsing can stay generic across provider payload shapes.

Main user-facing flow¶

The normal public flow is:

Construct a BatchJob.
Create a BatchRunner.
Call start() or run_and_wait().
Work with the returned Run.

Internally that expands to:

Resolve provider and storage implementations.
Persist run config and ingest items into durable state.
On resume, poll any already-active provider batches before ingesting or submitting new work.
Persist source items in durable work slices; keep the 1,000-item fast path, but flush a smaller atomic slice when the time/control budget requires it.
Claim a bounded submission window from pending items.
Build or replay request JSONL rows.
Persist request artifacts before upload.
Atomically reserve quota-scoped capacity and persist a submission intent before remote creation, then atomically register the returned batch with its item linkage.
Poll active batches.
Download output/error files.
Parse terminal item results back into the state store.

Ingestion polling stays inside the ingestion module behind its existing poll callback. Provider cadence and work-slice clocks are separate: fast sources retain normal batch granularity, while slow sources reach durable cooperative boundaries on a time budget. After each ingestion poll, control state and retry backoff are re-read before submission or further checkpointed-source materialization. Submission backoff blocks materialization/submission but never active-batch reconciliation.

When a caller uses Run.wait(), the runtime repeats that poll-and-submit pass until the run is terminal. A pass that changes durable work state, such as consuming completed batches or submitting more items, immediately triggers the next pass rather than sleeping for the configured poll interval.

Execution Sequence¶

sequenceDiagram
    participant User
    participant BatchRunner
    participant StateStore
    participant ItemSource
    participant ArtifactStore
    participant Provider as BatchProvider

    User->>BatchRunner: start(job) / run_and_wait(job)
    BatchRunner->>StateStore: create_run(run_id, config)
    BatchRunner->>ItemSource: iterate items
    ItemSource-->>BatchRunner: BatchItem stream
    BatchRunner->>StateStore: append_items(materialized_items)
    BatchRunner->>StateStore: set_ingest_checkpoint()

    opt active batch exists and monotonic poll cadence is due
        BatchRunner->>Provider: get_batch(batch_id)
        Provider-->>BatchRunner: BatchRemoteRecord
        BatchRunner->>StateStore: consume terminal batch state
    end

    BatchRunner->>StateStore: claim pending items within freed capacity

    loop refresh() cycle (polling + submission)
        BatchRunner->>StateStore: get_active_batches()
        StateStore-->>BatchRunner: [ActiveBatchRecord...]

        opt active batches exist
            BatchRunner->>Provider: get_batch(batch_id)
            Provider-->>BatchRunner: BatchRemoteRecord

            alt status == "completed"
                BatchRunner->>Provider: download_file_content(output_file_id, error_file_id)
                Provider-->>BatchRunner: output/error JSONL
                BatchRunner->>ArtifactStore: write_text(output/error artifacts)
                BatchRunner->>Provider: parse_batch_output()
                Provider-->>BatchRunner: successes + errors
                BatchRunner->>StateStore: mark_items_completed(successes)
                BatchRunner->>StateStore: mark_items_failed(errors)
                opt row-level insufficient quota
                    BatchRunner->>StateStore: record_batch_retry_failure(row_insufficient_quota)
                end
            else status in {failed, cancelled, expired}
                BatchRunner->>Provider: download_file_content(output_file_id, error_file_id)
                Provider-->>BatchRunner: output/error JSONL
                BatchRunner->>ArtifactStore: write_text(output/error artifacts)
                BatchRunner->>StateStore: record_batch_retry_failure()
                BatchRunner->>StateStore: reset_batch_items_to_pending()
            end

            opt control-plane or batch-level OpenAI insufficient quota detected
                BatchRunner->>StateStore: set_run_control_state(paused, control_reason)
                BatchRunner-->>User: RunPausedError / run_auto_paused event
            end
        end

        BatchRunner->>StateStore: claim_items_for_submission(limit)
        StateStore-->>BatchRunner: [ClaimedItem...]
        BatchRunner->>ArtifactStore: write_text(request JSONL artifact)
        BatchRunner->>Provider: upload_input_file(local_path)
        Provider-->>BatchRunner: remote_file_id
        BatchRunner->>Provider: create_batch(remote_file_id)
        Provider-->>BatchRunner: BatchRemoteRecord (status=submitted/validating)
        BatchRunner->>StateStore: register_batch()
        BatchRunner->>StateStore: mark_items_submitted()
    end

    BatchRunner-->>User: Run (terminal)
    User->>Run: results()
    Run->>StateStore: get_item_records()
    StateStore-->>Run: [PersistedItemRecord...]
    Run-->>User: [BatchResultItem...]

Batch Lifecycle¶

Item state machine¶

Each item in a run transitions through the following statuses:

stateDiagram-v2
    [*] --> PENDING : item ingested from source

    PENDING --> QUEUED_LOCAL : claimed for submission cycle
    QUEUED_LOCAL --> PENDING : released — batch submission failed or cycle ended early
    QUEUED_LOCAL --> FAILED_PERMANENT : rejected pre-submission (e.g. token budget exceeded, max attempts)
    QUEUED_LOCAL --> SUBMITTED : batch created and registered with provider

    SUBMITTED --> COMPLETED : batch completed, result parsed OK — attempt consumed
    SUBMITTED --> FAILED_RETRYABLE : batch error / item error — attempt count below max
    SUBMITTED --> FAILED_PERMANENT : batch error / item error — attempt count reached max

    FAILED_RETRYABLE --> PENDING : re-queued after backoff delay

    COMPLETED --> [*]
    FAILED_PERMANENT --> [*]

Run lifecycle¶

A run has two orthogonal state axes — lifecycle status (progress toward completion) and control state (operator signal).

stateDiagram-v2
    direction LR

    state "Lifecycle Status" as ls {
        [*] --> RUNNING : run created
        RUNNING --> COMPLETED : all items completed successfully
        RUNNING --> COMPLETED_WITH_FAILURES : run finished with ≥1 FAILED_PERMANENT item
    }

    state "Control State" as cs {
        [*] --> RUNNING2 : run created
        RUNNING2 --> PAUSED : pause_run() or quota auto-pause
        PAUSED --> RUNNING2 : resume_run() called
        RUNNING2 --> CANCEL_REQUESTED : cancel_run() called

        note right of CANCEL_REQUESTED
            Not reversible.
            Runner drains in-flight batches;
            lifecycle then becomes terminal.
        end note
    }

Detailed storage, resume, and artifact-retention semantics live in STORAGE_AND_RUNS.md.

Paused runs may include a durable control_reason. Manual pauses use "manual"; control-plane or batch-level OpenAI quota/billing exhaustion uses "openai_insufficient_quota" and preserves retryable work for later resume. Row-level quota records inside completed batch output remain item-level retryable failures and use retry backoff without pausing the run. Quota auto-pause never replaces cancel_requested, and non-checkpointed finite iterables continue materializing after auto-pause so unpersisted input rows are not lost.

Module boundaries¶

`core/`¶

Owns domain types and public configuration models:

BatchItem
BatchJob
PromptParts
RunSummary
RunSnapshot
provider and storage enums
provider config types such as OpenAIProviderConfig
retry, chunk, artifact, and terminal-result models

core/ should stay mostly declarative. It describes what a run is, not how the runtime executes it.

`providers/`¶

Owns provider-facing abstractions and implementations:

base provider contract
provider registry
OpenAI Batch implementation
Anthropic Message Batches implementation

The provider layer is responsible for:

building provider request rows
uploading input files
creating batches
polling batches
downloading provider files
normalizing provider output records

Durable artifact writing is not owned by the provider layer. The runner persists artifacts and hands staged local files to the provider.

`runtime/`¶

Owns execution behavior:

BatchRunner
Run
Typer CLI entrypoint for operator workflows
persisted run control state with pause, resume, and drain-style cancel
automatic OpenAI control-plane/batch-level insufficient-quota pause with durable control_reason
item-level insufficient-quota retries with backoff and no attempt consumption
optional observer callback for provider lifecycle events
token estimation and request chunking
bounded pending-item claim windows before submission
durable request-artifact replay for retry/resume
per-refresh request-artifact file caching during replay
artifact-store staging/export/delete orchestration
incremental terminal-result paging/export for already-terminal items
resumable deterministic-source checkpoints
fresh-process recovery of queued_local items back to pending submission
bounded concurrent provider polling and file download handling
explicit terminal-run artifact pruning
explicit raw-artifact export before raw-artifact pruning
retry helpers
response validation and structured-output parsing

This is where the durable lifecycle lives. It bridges the domain models, providers, storage, and artifact store.

The public facade is still BatchRunner plus Run. A deep internal RunExecutor module owns execution attachment and the ordered advance cycle, returning a structured outcome that reports durable progress. The supporting implementation under runtime/ remains split by concern:

execution.py: claim recovery on execution attachment, source-availability checks, and ordered ingest/poll/cancel/submit advancement
context.py: persisted config building, resume config comparison, output-model resolution, and provider-backed RunContext creation
ingestion.py: checkpoint-aware item materialization, ingest checkpoint updates, and resume-aware ingestion flow
submission.py: pending-item claiming, request replay/build, token-budget gating, request artifact persistence, and batch submission
polling.py: refresh orchestration, active-batch polling, terminal batch consumption, and batch-failure reset/backoff handling
artifacts.py and results.py: request replay helpers, raw artifact writes/deletes, public result mapping, and JSONL result serialization

`sources/`¶

Owns streaming input adapters:

ItemSource
CheckpointedItemSource
CompositeItemSource
CsvItemSource
JsonlItemSource
ParquetItemSource

The built-in file sources support durable resume through a source fingerprint plus an ingest checkpoint stored in the control plane. CompositeItemSource composes explicit checkpointed sources into one logical source without moving source discovery or partition ordering into the runner.

`storage/`¶

Owns the durable and ephemeral state backends:

StateStore
SQLite implementation
Postgres implementation
in-memory implementation
storage registry

The storage layer persists:

run config
run control state
run control reason
item state and attempts
active batch metadata
ingest checkpoints
parsed terminal outputs
terminal result sequence metadata
pointers to durable artifacts

`artifacts/`¶

Owns the payload plane for large durable files:

request JSONL used for submission/replay
raw output JSONL downloaded from the provider
raw error JSONL downloaded from the provider

The built-in implementation is LocalArtifactStore.

`cli.py`¶

Owns a deliberately narrow operator interface over the runtime:

one or more explicit CSV and JSONL inputs only
SQLite only
JSON summaries
explicit status, wait, results, export, and prune commands

The CLI is intentionally not the place where new orchestration behavior should be invented first.

Why storage and artifacts are split¶

This is one of the most important design choices in the repo.

The state store holds indexed, queryable control-plane data. The artifact store holds larger opaque files that must survive retry/resume/export workflows.

That split gives batchor:

resumable request replay
smaller durable control-plane rows
clearer retention rules
backend flexibility for future non-local artifact stores

Current invariants¶

Public execution is run-oriented: start(), get_run(), run_and_wait().
OpenAI Batch, Anthropic Message Batches, and text-only Gemini Batch are built-in Python and CLI providers; OpenAI remains the CLI default.
SQLite is the default durable backend.
Postgres is an opt-in durable backend for shared control-plane state.
Structured outputs require a module-level Pydantic v2 model for rehydration.
Run.results() is terminal-only.
Run.refresh() is explicit; summary properties do not implicitly hit the provider.
Durable artifacts flow through the ArtifactStore contract; the built-in implementation is LocalArtifactStore.
Stored item rows keep artifact keys, not absolute filesystem paths.
Fresh-process resume requeues queued_local items before attempting submission again.
Run.prune_artifacts() is explicit and terminal-only; it is not automatic garbage collection.
File-backed source resume requires a caller-supplied run_id plus a stable source fingerprint.
Raw output/error artifacts persist by default and require export before raw-artifact pruning.
A terminal run may be either completed or completed_with_failures; both statuses allow artifact export/prune and final result access.
Provider secrets may exist in in-memory config objects, but durable storage persists public provider config only.
CLI .env loading is a CLI-only convenience and not part of library runtime behavior.
Run lifecycle status and run control state are separate; pause/cancel do not redefine terminal lifecycle semantics.
Incremental terminal-result reads are sequence-based and only return items that have already reached a terminal item state.
Built-in deterministic-source resume currently covers CSV, JSONL, and Parquet; arbitrary iterables still do not become durable by magic.
An incomplete ingest checkpoint prevents terminal lifecycle status. A fresh process must reattach the source with start(job, run_id=...) before execution can advance, unless cancellation intentionally abandons the unmaterialized source tail and finalizes the checkpoint.
BatchItem.metadata["batchor_lineage"] is reserved for lightweight source/join metadata when provided by built-in adapters or callers.
Provider-terminal status and local result consumption are distinct: a terminal batch remains locally active while submitted items or an unreleased capacity reservation still require replay.
Remote batch creation is preceded by a durable submission intent. An ambiguous create outcome requires explicit operator resolution as either the known remote batch or verified-not-created; the runtime never guesses and resubmits.
OpenAI enqueue capacity is reserved atomically across runs by a stable provider/model/quota scope without persisting credential-derived identities.
wait(timeout=...) is cooperative: the deadline bounds sleeps and prevents new library-controlled work, while an already-running user callback or provider SDK call is governed by its own timeout.

Extension seams¶

The code is shaped for future providers and backends, but within explicit boundaries:

provider config serialization goes through the provider registry
storage creation goes through the storage registry
runtime code works in terms of provider/store contracts instead of direct OpenAI/SQLite branches
durable request replay is provider-agnostic at the runner/store boundary and materializes through the artifact-store contract
deterministic-source resume uses source-specific checkpoints and currently supports the built-in CSV, JSONL, and Parquet sources
provider observability hooks are callback-based and currently emit coarse lifecycle events from the runner

Current gaps¶

Gemini request construction is text-only
the only built-in artifact backend is local filesystem storage
arbitrary non-checkpointable iterables do not support mid-ingest crash recovery
the CLI does not expose the full Python API surface

TBD¶

multi-provider capability matrix doc
remote/object-store artifact backend
provider-side remote cancellation