The True Cost of AI: Optimizing OpenAI API Calls with Redis Caching

AI Is Cheap… Until It Isn’t

If you’re just getting started with OpenAI, the pricing page looks almost boringly low.

But once your product has:

Thousands of daily active users
Generous context windows
Multiple model calls per interaction

…that “cheap” API quickly turns into one of your most painful line items.

The tricky part: by the time the bill hurts, it’s usually too late. You’ve already shipped UX flows and backend patterns that assume “just call the model again,” and now every optimization feels like surgery.

One of the highest-leverage fixes we deploy for clients is caching—specifically, wrapping OpenAI calls with Redis. Done right, you get:

Lower costs: fewer tokens burned on repetitive work.
Lower latency: instant responses for cache hits.
Higher reliability: less pressure on rate limits and provider errors.

This post walks through how to think about the true cost of AI in your product and how to use Redis caching to get it under control.

Where Your OpenAI Spend Really Goes

When you look at a raw OpenAI invoice, it’s just tokens in / tokens out.

Under the hood, most bills are dominated by a few patterns:

Repeated prompts over static data
- “Summarize this product description”, “Write a title for this blog post” — called over and over with the same or very similar inputs.
Large, unbounded context windows
- Every chat request drags your entire conversation history plus a massive system prompt.
Multi-model pipelines
- A single user action fans out into: classification → RAG retrieval → synthesis → follow-up calls.
Background jobs
- Batch summarization, tagging, and enrichment jobs running nightly over the same records.

The common thread: a lot of these calls are deterministic or “good enough” to reuse.

That’s where caching shines.

Why Redis Is a Great Fit for AI Caching

You can cache AI responses in many places (in-memory, CDN, database), but Redis hits a sweet spot:

Sub-millisecond access from your API servers.
Built-in TTLs so you can expire cached responses naturally.
Simple data model (strings, hashes, JSON) that plays nicely with serialized responses.
Horizontal scalability once you grow.

You don’t need a fancy architecture to start:

Compute a cache key from the OpenAI request.
Look up that key in Redis.
If there’s a hit, return it.
If there’s a miss, call OpenAI, store the result in Redis, then return it.

The nuance is in what you put into the key and where you decide to cache.

A Thin Redis Layer Around OpenAI

Here’s a simplified pattern around chat.completions in Node/TypeScript:

import OpenAI from "openai";
import { createClient } from "redis";
import crypto from "crypto";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const redis = createClient({ url: process.env.REDIS_URL });

await redis.connect();

function buildCacheKey(input: {
  model: string;
  systemPrompt: string;
  userPrompt: string;
  temperature: number;
  version: string; // your own versioning to bust cache
}) {
  const body = JSON.stringify(input);
  const hash = crypto.createHash("sha256").update(body).digest("hex");
  return `ai:chat:${hash}`;
}

async function cachedChat(options: {
  model: string;
  systemPrompt: string;
  userPrompt: string;
  temperature?: number;
  version?: string;
  ttlSeconds?: number;
}) {
  const {
    model,
    systemPrompt,
    userPrompt,
    temperature = 0,
    version = "v1",
    ttlSeconds = 60 * 60, // 1 hour
  } = options;

  const key = buildCacheKey({
    model,
    systemPrompt,
    userPrompt,
    temperature,
    version,
  });

  const cached = await redis.get(key);
  if (cached) {
    return JSON.parse(cached);
  }

  const response = await openai.chat.completions.create({
    model,
    temperature,
    messages: [
      { role: "system", content: systemPrompt },
      { role: "user", content: userPrompt },
    ],
  });

  const answer = response.choices[0].message.content;
  await redis.set(key, JSON.stringify(answer), { EX: ttlSeconds });

  return answer;
}

This is intentionally opinionated:

Temperature in the key: caching only makes sense when responses are stable (typically temperature close to 0). If you want more creative outputs, you can still cache, but expect more variation.
Version in the key: bump this when you tweak the system prompt, change tools, or ship a major behavior change.
Short TTLs: you rarely want to cache AI responses forever. Start with minutes or hours, not days.

Designing Good Cache Keys for AI

Bad cache keys are worse than no caching at all. They give you:

Stale responses after prompt changes.
Data leaking across tenants.
Confusing behavior when you switch models.

When we design keys around OpenAI calls, we typically include:

Model name
- gpt-4o, gpt-4.1-mini, etc. Different models → different behavior and pricing.
System prompt (or at least a hash of it)
- If you change your “personality” or instructions, you want a fresh cache.
User-visible input
- The actual query, document ID, or content being summarized.
Temperature & other parameters
- Anything that significantly influences the output.
App-level versioning
- A manual version string that you bump when in doubt.

And then we prefix it clearly:

ai:chat:…
ai:summary:…
ai:embedding:…

This makes it easy to inspect and purge subsets of keys in production.

What You Should Cache First

Not every AI call is a good cache candidate. Start where:

Inputs are highly repetitive.
Data is mostly static.
You don’t mind slightly stale answers.

Some high-ROI examples:

Embeddings for the same text
- Before you call embeddings.create, hash the text. Check Redis using ai:embed:<hash>. If it exists, reuse the embedding instead of paying again.
Summaries of static content
- Blog posts, help articles, product descriptions. Summarize once, cache forever (or until the content changes).
Classification & tagging
- “Is this spam?”, “Which category is this ticket?” — perfect for long TTL caches.
RAG retrieval results
- If you run the same search query over the same index repeatedly, cache the retrieved document IDs, then only call the model on top.

In practice, these alone can shave 30–60% off your OpenAI usage on mature products.

What You Shouldn’t Cache (Or Cache Carefully)

On the other side, there are calls where caching either doesn’t help much or adds product risk:

Highly personalized chat
- If each response depends on a rich user profile, real-time data, or sensitive context, cross-user caching is dangerous. You can still cache per user, but benefits are smaller.
High-temperature creative writing
- If your product promise is “always fresh, never the same answer twice”, heavy caching can make the experience feel stale.
Security-critical decisions
- Anything involving auth, payments, or compliance should not silently serve cached responses without very careful design.

When in doubt, start with explicit, narrow caches around obviously repetitive work instead of trying to cache everything.

Measuring If Caching Actually Saves You Money

Caching feels good, but you should still prove it.

At minimum, log:

Cache hit rate (per endpoint, per model).
Average tokens saved per hit.
Latency for hits vs misses.

You can then estimate savings with a rough formula:

\[ \text{Monthly savings} \approx \text{hits per month} \times \text{tokens per call} \times \text{price per token} \]

Even with conservative numbers, it’s common to see:

Double-digit percentage reductions in spend.
100–300 ms faster perceived responses on cache hits.

If you correlate this with user behavior (more messages per session, higher retention), caching quickly pays for itself.

Operational Concerns: Don’t Create a New Single Point of Failure

Introducing Redis means introducing another dependency. A few practical tips:

Fail open, not closed
- If Redis is down or slow, skip the cache and call OpenAI directly. Your product should keep working (just more expensively).
Timeouts everywhere
- Don’t let a slow Redis roundtrip block your user. Use short timeouts and default to an API call if needed.
Observability
- Track cache hit/miss metrics, latency, and error rates. It’s the only way to know if your caching strategy is healthy.
Scoped keys
- Use prefixes and maybe per-environment postfixes (:dev, :staging, :prod) to avoid accidents when you run multiple environments.

Bringing It All Together

The “true cost” of AI in your product isn’t just the per-token price. It’s:

How often you recompute the same answer.
How much latency users feel on every interaction.
How frequently you hit rate limits at peak traffic.

By wrapping your OpenAI calls with a thin Redis caching layer, you:

Turn obviously repetitive work into instant responses.
Free up budget for the places where freshness actually matters.
Buy yourself headroom to ship more AI features without panicking over the next invoice.

Start small: pick one noisy endpoint, put Redis in front of it, and measure. Once you see the impact, it becomes natural to treat caching as a first-class part of your AI architecture—not an afterthought when the bill arrives.