LLMs Are Cheap Enough to Change How You Work With Data
We talk a lot about whether AI is smart enough to take our jobs. But I think we’re missing half the equation: it’s not just about being smart, it’s also about being cheap. Last month I backfilled roughly 60,000 historical job postings on selectfrom.work using an LLM. Each request extracted a summary, salary information, technologies mentioned, and job type. Total cost? $10.

Sure, there are some mistakes here and there. But for me this enables a type of analysis at a scale that was completely out of reach before. I don’t think an actual job was taken here, but there’s definitely a competitive advantage to be had when you can process tens of thousands of documents for the price of two coffees.
In this post I’ll walk you through how to set up batch processing with OpenAI’s API, the orchestration tricks that keep costs down, and some of the gotchas I ran into along the way. We’ll cover:
- OpenAI Batch vs Flex processing
- Why batch mode is your best friend for price efficiency
- How to structure your batch requests
- Orchestrating the process without losing your mind
- Tracking batches in a database
OpenAI Batch vs Flex processing
OpenAI offers two discounted processing modes: Batch and Flex. Batch processing runs requests within a 24-hour window (often much faster), making it ideal for bulk workloads where timing isn’t critical. Flex applies similar cost savings to individual requests, but they may complete slower than standard real-time calls. Flex is still in beta, and I’ve experienced day long outages.
In practice, I’ve found Batch more predictable for backfills: you can submit small batches (often faster to validate/process), monitor status, and retry failures without blocking the rest of your pipeline.
Batch Mode: Half the Price, Most of the Speed
If you want price-efficient LLMs with convenience and scale, batch mode is where it’s at. OpenAI’s batch API processes your requests at half the price within a 24-hour window—but in practice, batches often complete in 10-15 minutes. A batch can be as small as you want, though there’s a limit on how many tokens you can queue at once. There’s a pretty good guide from OpenAI that I recommend you to look at, but I’ll share some additional thoughts and tricks.
When it comes to cost, there are two levers: input tokens and output tokens. For long documents (like job postings) you often can’t shrink inputs much beyond keeping the prompt concise. Output tokens are where costs can explode—especially if you enable heavy “reasoning/thinking” on models that support it. For extraction and summarization, set reasoning effort to minimal (or disable it) and cap output tokens aggressively.
Here’s what a basic prompt template looks like for extracting structured data from job descriptions:
def get__prompt_template() -> str:
# Keep this short: it is repeated for every request.
return "\n".join([
"Task: Produce a JSON object from the job description below.",
"description_summary: <=100 words, one paragraph, include 1 differentiator and 1 risk",
"Do not invent missing data. Extract salary exactly if present; else leave defaults.",
"tools_and_technologies_list: only literal mentions.",
"job_type: one of analyst, data_scientist, data_engineer, analytics_engineer, "
"machine_learning_engineer, chief, other.",
"is_intermediary: true only if an agency is indicated.",
"Return ONLY this JSON structure (same key order):",
"{",
' "description_summary": "",',
' "salary_range": { ... },',
' "tools_and_technologies_list": [],',
' "job_type": "",',
' "is_intermediary": false',
"}",
"Job Description:",
"{job_description}",
])
def generate_prompt(*, doc_id: str, description: str, system: str) -> dict:
prompt_template = get_prompt_template()
formatted_prompt = prompt_template.format(job_description=description)
return {
"id": doc_id,
"system": system, # keep this short and stable
"prompt": formatted_prompt,
}
The prompt is deliberately terse. Every token in your prompt is a token you’re paying for, and explicit instructions like “do not invent missing data” help keep the model focused on extraction rather than creative writing.
Creating Batch Input Files (JSONL)
OpenAI’s Batch API expects a JSONL file where each line is one HTTP request (custom_id, method, url, body). The current Batch guide uses the Responses API (/v1/responses) which has support for setting the thinking mode, although the chat completions API also still works (v1/chat/completions):
def build_batch_jsonl(filepath: str, model: str) -> str:
"""
jobs: [{"id": "...", "system": "...", "prompt": "..."}]
Writes one JSON request per line in JSONL format.
"""
os.makedirs(os.path.dirname(filepath), exist_ok=True)
docs = [generate_prompt(description) for description in fuction_to_get_unprocessed_documents()]
with open(filepath, "w", encoding="utf-8") as f:
for doc in docs:
line_object = {
"custom_id": str(doc["id"]), # crucial for joining responses back to source rows
"method": "POST",
"url": "/v1/responses",
"body": {
"model": model,
"input": [
{"role": "system", "content": doc["system"]},
{"role": "user", "content": doc["prompt"]},
],
# Keep output bounded; tune to your schema size.
"max_output_tokens": 600,
# If you’re using a reasoning-capable model, set minimal effort for extraction:
# "reasoning": {"effort": "low"},
},
}
f.write(json.dumps(line_object, ensure_ascii=False) + "\n")
return filepath
def create_batch(*, jobs: list[dict], description: str, model: str) -> object:
client = OpenAI()
filepath = build_batch_jsonl(
jobs=jobs,
filepath=f"tmp/batchinput-{datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S')}.jsonl",
model=model,
)
with open(filepath, "rb") as fh:
batch_input_file = client.files.create(file=fh, purpose="batch")
return client.batches.create(
input_file_id=batch_input_file.id,
endpoint="/v1/responses",
completion_window="24h",
metadata={"description": description},
)
# Note: If you batch Chat Completions instead, your url/body must match /v1/chat/completions.
# The Batch API doesn’t “translate” between endpoints.
The custom_id field is crucial—it’s how you’ll match responses back to your original records. I use the job posting ID directly, which makes the downstream processing much simpler.
The orchestration dance (polling, token limits, graceful exits)
Batch status typically moves through validating → in_progress → completed (or failed/expired/cancelled).
batch = add_new_job_summaries_batch(
limit=200,
model=os.getenv("MODEL_NAME", "gpt-5-nano"),
)
while batch and batch.status in ["validating"]:
sleep(5)
batch = retrieve_batch(batch.id)
if batch.status == "failed":
for error in batch.errors.data:
if error.code == "token_limit_exceeded":
logging.warning(f"Batch {batch.id} reached token limit, skipping")
sys.exit(0)
else:
logging.error(f"Batch {batch.id} failed: {batch.errors.data}")
sys.exit(1)
elif batch.status == "validating":
logging.info(f"Batch {batch.id} is still validating...")
The token_limit_exceeded error is your friend, not your enemy. It means you’ve hit your queue limit—just exit gracefully and let the next scheduled run pick up where you left off. If you run this hourly, you get decent throughput that automatically scales up as you spend more (OpenAI increases your limits based on usage).
Tracking State in a Database
When you’re processing tens of thousands of records, you need to know what’s been processed and what hasn’t. I track batches in a simple table:
def save_batch_to_db(batch, batch_size: int):
new_batch = JobSummaryBatch(
batch_id=batch.id,
input_file_id=batch.input_file_id,
status=batch.status,
description=batch.metadata.get("description", ""),
num_jobs=batch_size
)
with next(get_session()) as session:
session.add(new_batch)
session.commit()
This pattern of query for unprocessed records, submit a batch, track the batch, track your own processing, repeat is the core loop that lets you process arbitrary amounts of data over time without manual intervention.
Tips for Maximum Cost Efficiency
After processing 60,000 job postings, here’s what I’ve learned:
Use the cheapest model that works. For extraction tasks, gpt-4.1-nano is unfortunately no longer available, so gpt-5-nano is usually your best option for summaries and text extraction. Save the expensive models for tasks that actually need reasoning. The pricing page is a good starting point. If you’re flexible you can also have multiple models to increase the maximum number of queued tokens, which is mostly per model category (e.g. 4, 4o, 5, etc.)
Keep prompts tight. And your tokens tighter. Every word in your system prompt is repeated for every request. A 500-token prompt across 60,000 requests adds up fast. My ratio was around 10/2.5 for input/output tokens. Input and output can be priced differently depending on your model.
Try before you buy. Play around with a few different models on 10-20 documents to see the token counts, types of responses, hallucinations, etc.
Use flex mode for one-offs. When you need to process something outside your normal batch cycle, flex mode gives you the same 50% discount on individual requests—they’re just slightly slower and less reliable. Right now it’s nice for testing small things.
Handle failures gracefully. Some requests will fail. Some responses will be malformed. Build your pipeline to retry and move on rather than stopping at the first error.
The real unlock here isn’t any single technique, it’s recognizing that LLMs are now cheap enough to use for bulk data processing. The same extraction task that would have required weeks of manual work or expensive specialized services can now run overnight for pocket change. That changes what’s possible and what you can achieve on your own.
