Fixing FastAPI Throughput Without Going Fully Async

We use FastAPI for our backend APIs. For the last couple of years, we’ve struggled with throughput in production. During peak traffic, we’d often run into gateway timeouts—even though the API service nodes running on ECS weren’t showing high CPU usage.

We’ve long suspected the issue had something to do with how we were using FastAPI. This week, we finally figured it out.

The Context

FastAPI allows you to define endpoints in two ways:

1. Synchronous function:

@r.get("/sample-1")
def sample_1():
    return {"success": True}

2. Asynchronous function:

@r.get("/sample-2")
async def sample_2():
    return {"success": True}

According to the FastAPI docs:

Use def if your code calls synchronous, blocking I/O libraries
Use async def if your code uses non-blocking, async-aware libraries

For the last year, we “solved” our throughput issues by scaling up hardware during peak hours. But this week, we had the time to actually get to the root of the problem.

What We Missed

The FastAPI docs briefly mention that def functions are executed in a threadpool. That part never stood out to us in earlier reads. We assumed the defaults were good enough.

But after digging further into Starlette’s docs , we discovered that the default thread pool size is just 40. This is a really low value!

Our backend primarily uses blocking I/O libraries—SQLAlchemy for Postgres, blocking Redis clients, and RabbitMQ libraries. Meanwhile, our endpoint handlers were a mix of async def and def, written without a clear understanding of the tradeoffs. This led to two types of problems:

Blocking the event loop when using async def with blocking libraries
Getting throttled by the threadpool when using def, due to the low default pool size

The Fix

We considered two options:

Refactor the codebase to use async I/O libraries end-to-end
Convert all endpoints to synchronous (def) and increase the threadpool size

Option 1 is impractical for us, primarily because our SQLAlchemy-based CRUD code is shared between FastAPI and Celery, which is sync-first. To fully adopt async, we’d either need to maintain duplicate versions of the CRUD logic (sync and async) or wrap Celery calls with asyncio.run, which introduces complexity.

A full async migration would require halting all other engineering work, dedicating effort to refactoring and testing across the board. Given how much of our business depends on the current system, this level of disruption is risky and likely unacceptable to product and business teams.

So we took the practical route.

We converted all our FastAPI route handlers to def and bumped the threadpool size using the following idea:

@asynccontextmanager
async def lifespan(_app: FastAPI) -> AsyncIterator[None]:
    to_thread = anyio.to_thread
    limiter = to_thread.current_default_thread_limiter()
    limiter.total_tokens = ANYIO_THREAD_COUNT
    yield

app = FastAPI(
    lifespan=lifespan,
)

We rolled out the changes gradually over 2–3 days:

Migrated a few endpoints at a time to def
Increased ANYIO_THREAD_COUNT incrementally—we eventually went up to 2000 with no issues

The Results

Improved resource utilization

Previously we would struggle to get throughput in our API server even though there was a lot of CPU left to be utilized – This meant that we could not autoscale properly and we would have to rely on cron script to set number of nodes in our API service just to be able to meet the traffic needs.

See below graph – Average CPU utilization was never above 20% even though we would face 504 errors on the load balancer.

Now we are able to scale up based on the CPU utilization metric as the CPUs are being used in a much better fashion. We are able to run 50% of the nodes as compared to before and the resource utilization is much before as CPU usage is able to go above 40%.

As a result, we are able to do much higher requests per target than before: it used to be around 800 requests per minute per node.

Now its about 2-3k requests per minute per node.

There is still work to be done. We are going to continue to experiment with the thread count and see what is the minimum number of nodes we need for our kind of traffic and ensure we are doing a lot of requests per minute per node.

Improved average latency

Despite throwing more nodes at the problem, we would run into poor average API latencies previously.

Now the latencies are much more stable and consistently under 150ms.

These changes gave us an immediate performance win, with very little risk.

If you’re using FastAPI and running into similar issues, it’s worth taking a hard look at how your view functions are defined and whether your threadpool is holding you back.

Updates after 1 month of production observations

The threadpool of anyio seems to be a lazy threadpool – no threads are created unless required.
The high threadcount of 2000 does not matter as much if your API latency is low – But setting a high enough value should be okay as the threadpool itself seems to be lazy.
Moving to sync was really the right decision for us and can work well for anyone else facing the same problem.
Even without async, its possible to get great performance from FastAPI – one day we may start async library migration but we do not need to worry about it for next 12-18 months.