Guide · 19 June 2026 · ~7 min read
How to reduce your Claude & GPT API costs (without degrading output)
If your LLM bill is climbing faster than your usage, the problem usually isn't your users — it's your prompts. Specifically, the system promptyou resend on every single API call. This guide walks through the three levers that actually move the number: auditing system-prompt waste, using prompt caching, and compressing safely without the "compression illusion."
1. Understand where the money actually goes
Every chat completion bills you for input tokens and output tokens. It's tempting to blame the user's message, but in most production setups the user turn is small and the system prompt is large and fixed. You pay for that fixed block on every call. If your system prompt is 1,500 tokens and you make 200 calls a day, you're paying for 9 million system-prompt tokens a month before a single user has typed anything.
So the first move is to measure your system prompt in tokens, not characters, and multiply by your real call volume. A 30% reduction on a prompt you send 6,000 times a month is a very different number than 30% on a one-off.
2. Audit the waste in your system prompt
The most common sources of waste, in rough order of how often we see them:
- Duplicated instructions.The same rule written twice. Models do not follow a rule "harder" because you repeated it — you just pay for it twice.
- Persona padding."You are a helpful, professional, friendly assistant who always strives to…" A short role line is enough; the rest is tokens the model effectively ignores.
- Politeness and filler."Please make sure that you always…", "It is very important that…", "in order to", "kindly". These rarely change behavior.
- Restated output formats. Telling the model to return JSON three times. Once, clearly, is enough.
- Oversized few-shot examples.Examples are powerful, but they're often the single largest movable cost. This is the one to trim carefully — see section 4.
Cutting the first four categories is essentially lossless: you remove tokens without changing what the model does. This alone often reclaims 15–40% of a bloated system prompt.
3. Use prompt caching — the ~10x discount almost nobody uses
This is the highest-leverage change and the most overlooked. Both Anthropic and OpenAI let you cache a stable prefixof your prompt. When a later request reuses that prefix, you're billed a fraction of the normal input price for those tokens. On Anthropic, cached reads are roughly 10% of input cost— effectively a 10x discount on the part of your prompt that never changes. OpenAI's cached input is typically 25–50% of normal.
How to make your prompt cacheable
Caching works on a contiguous prefix, so the structure has to cooperate: put everything static first (role, rules, tools, reference context), and everything dynamic last(the user's turn, per-request variables). A surprising number of prompts are accidentally cache-hostile because they interleave a dynamic value in the middle of otherwise static instructions — which breaks the cacheable prefix at that point.
Once your static block is ≥1,024 tokens (the typical minimum for caching to engage), set a cache breakpoint right before the dynamic tail. For a 1,500-token system prompt sent thousands of times a month, this is usually a bigger saving than any amount of word-level trimming.
4. Compress safely — beware the "compression illusion"
Here's where most prompt-optimization advice goes wrong. It's easy to shrink a prompt 50% and feel great — until output quality quietly drops on the inputs you didn't test. We call this the compression illusion: the leaner prompt looks fine on the happy path and fails on edge cases.
The safe rule is to separate two kinds of cuts:
- Mechanical / lossless cuts — filler, duplicates, whitespace, redundant politeness. Apply these freely.
- Content cuts— examples, constraints, edge-case handling, format specs. These carry real risk. Don't remove them blind; verify first.
Verify with a real output delta
The honest way to compress aggressively is to run your original and compressed prompts on the same realistic inputs and compare the outputs. If they match, ship the cut. If they drift, you've just caught a regression before it hit production. Never trust a token reduction number on its own — trust it alongside a measured quality delta.
5. Pick the right model per task
Not every call needs your most expensive model. Routine extraction, classification, and formatting often run fine on a cheaper, faster model (GPT-4o mini, Claude Haiku) at a fraction of the cost, while hard reasoning stays on a frontier model. Routing by task is a structural saving that compounds with the prompt-level wins above.
A quick checklist
- Measure the system prompt in tokens × monthly call volume.
- Remove duplicates, persona padding, and filler (lossless).
- Restructure static-first / dynamic-last and set a cache breakpoint.
- Trim examples only after verifying the output delta.
- Route cheap tasks to cheaper models.
Do it automatically
That's exactly what we built Token-Trim to do. Paste a prompt and it counts tokens (exact for OpenAI o200k models, approximate for Claude), flags each piece of waste with the tokens saved, finds your caching breakpoint, produces a safe rewrite, and scores the quality risk of every cut — with a Pro option that runs the real original-vs- compressed output comparison for you.
Run a free audit on your prompt → (3 free audits a month, no signup.)