Boundary And Philosophy¶

Status: current implementation direction as of 2026-04-03.

This document defines the concrete boundary between batchor, its durable storage, and the user's pipeline.

Core Philosophy¶

batchor is a durable batch execution library.

It is not:

a full workflow orchestrator
a general-purpose data warehouse
the user's source-of-truth dataset store
the user's downstream business-side effect engine

The library should stay narrow and reliable: accept work items, persist enough state to resume safely, execute provider batches, and surface typed results plus lineage back to the caller.

Boundary¶

What `batchor` owns¶

durable run and item lifecycle state
provider request construction and batch submission
retry classification and attempt tracking
durable replayable request persistence once a request artifact has been prepared
explicit terminal cleanup of replay artifacts
result parsing, validation, and run rehydration
lightweight lineage metadata needed to join results back to user systems
deterministic mid-ingest resume for sources that provide a stable identity and durable checkpoint

What the user pipeline owns¶

discovering or partitioning source data
selecting and ordering explicit files or partitions before handing them to batchor
transforming domain rows into BatchItems
business-specific decisions about what to run and when
applying terminal results back into application tables, files, or APIs
long-term storage of business data outside batchor's execution window

Durability Boundary¶

Durability begins when an item is accepted into a Run.

In the current implementation, batchor may still hold inline prompt data before the first request artifact is written. Once a request artifact exists, that artifact becomes the replay source for later retries and resumes.

For SQLite-backed runs:

SQLite is the control-plane ledger
request/output JSONL files are durable artifacts beside the database
item rows store artifact pointers, not the full long-term replay payload

This means a fresh worker can resume a run from the same SQLite database plus sibling artifact directory without rebuilding prompts from the original input source after the request artifact has been created.

Non-Goals¶

requiring users to adopt Temporal, Airflow, or any specific orchestrator
rebuilding the entire upstream source-generation workflow during resume
persisting arbitrarily large source payloads inline in SQLite forever
hiding all pipeline boundaries behind one monolithic library API

Current Implementation Notes¶

SQLite-backed runs persist per-item request artifact path, line number, and request hash.
Once that pointer exists, batchor can prune large inline request-building fields from SQLite.
Built-in CSV and JSONL sources persist ingest checkpoints so start(job, run_id=...) can resume from the last durable source position before request-artifact replay takes over.
Built-in deterministic sources now include CSV, JSONL, Parquet, and CompositeItemSource for explicit ordered composition of checkpointed sources.
batchor still does not discover files or partitions on the caller's behalf; users choose the inputs and their order, and batchor treats that ordered list as one logical source identity.
CheckpointedItemSource is the extension seam for custom deterministic adapters; batchor persists opaque checkpoints but does not make arbitrary iterables durable by itself.
Once the run is terminal, batchor exposes explicit artifact pruning so users can reclaim replay storage without losing terminal results.
Raw output/error artifacts are retained until users explicitly export them, then prune them.
Runs may opt out of raw output/error artifact persistence through ArtifactPolicy, while request artifacts remain mandatory replay state.
Artifact storage is local filesystem only today.
Mid-ingest crash recovery before the first request artifact exists is implemented only for deterministic checkpointable sources. Other item iterables are still TBD.