We use FastAPI for our backend APIs. For the last couple of years, we’ve struggled with throughput in production. During peak traffic, we’d often run into gateway timeouts—even though the API service nodes running on ECS weren’t showing high CPU usage.
We’ve long suspected the issue had something to do with how we were using FastAPI. This week, we finally figured it out.

The Context
FastAPI allows you to define endpoints in two ways:
1. Synchronous function:
@r.get("/sample-1")
def sample_1():
return {"success": True}
2. Asynchronous function:
@r.get("/sample-2")
async def sample_2():
return {"success": True}
According to the FastAPI docs:
- Use
def
if your code calls synchronous, blocking I/O libraries - Use
async def
if your code uses non-blocking, async-aware libraries
For the last year, we “solved” our throughput issues by scaling up hardware during peak hours. But this week, we had the time to actually get to the root of the problem.
What We Missed
The FastAPI docs briefly mention that def
functions are executed in a threadpool. That part never stood out to us in earlier reads. We assumed the defaults were good enough.
But after digging further into Starlette’s docs , we discovered that the default thread pool size is just 40. This is a really low value!
Our backend primarily uses blocking I/O libraries—SQLAlchemy for Postgres, blocking Redis clients, and RabbitMQ libraries. Meanwhile, our endpoint handlers were a mix of async def
and def
, written without a clear understanding of the tradeoffs. This led to two types of problems:
- Blocking the event loop when using
async def
with blocking libraries - Getting throttled by the threadpool when using
def
, due to the low default pool size
The Fix
We considered two options:
- Refactor the codebase to use async I/O libraries end-to-end
- Convert all endpoints to synchronous (
def
) and increase the threadpool size
Option 1 is impractical for us, primarily because our SQLAlchemy-based CRUD code is shared between FastAPI and Celery, which is sync-first. To fully adopt async, we’d either need to maintain duplicate versions of the CRUD logic (sync and async) or wrap Celery calls with asyncio.run
, which introduces complexity.
A full async migration would require halting all other engineering work, dedicating effort to refactoring and testing across the board. Given how much of our business depends on the current system, this level of disruption is risky and likely unacceptable to product and business teams.
So we took the practical route.
We converted all our FastAPI route handlers to def
and bumped the threadpool size using the following idea:
@asynccontextmanager
async def lifespan(_app: FastAPI) -> AsyncIterator[None]:
to_thread = anyio.to_thread
limiter = to_thread.current_default_thread_limiter()
limiter.total_tokens = ANYIO_THREAD_COUNT
yield
app = FastAPI(
lifespan=lifespan,
)
We rolled out the changes gradually over 2–3 days:
- Migrated a few endpoints at a time to
def
- Increased
ANYIO_THREAD_COUNT
incrementally—we eventually went up to 2000 with no issues
The Results
Improved resource utilization
Previously we would struggle to get throughput in our API server even though there was a lot of CPU left to be utilized – This meant that we could not autoscale properly and we would have to rely on cron script to set number of nodes in our API service just to be able to meet the traffic needs.
See below graph – Average CPU utilization was never above 20% even though we would face 504 errors on the load balancer.

Now we are able to scale up based on the CPU utilization metric as the CPUs are being used in a much better fashion. We are able to run 50% of the nodes as compared to before and the resource utilization is much before as CPU usage is able to go above 40%.

As a result, we are able to do much higher requests per target than before: it used to be around 800 requests per minute per node.

Now its about 2-3k requests per minute per node.

There is still work to be done. We are going to continue to experiment with the thread count and see what is the minimum number of nodes we need for our kind of traffic and ensure we are doing a lot of requests per minute per node.
Improved average latency
Despite throwing more nodes at the problem, we would run into poor average API latencies previously.

Now the latencies are much more stable and consistently under 150ms.

These changes gave us an immediate performance win, with very little risk.
If you’re using FastAPI and running into similar issues, it’s worth taking a hard look at how your view functions are defined and whether your threadpool is holding you back.
Updates after 1 month of production observations
- The threadpool of anyio seems to be a lazy threadpool – no threads are created unless required.
- The high threadcount of 2000 does not matter as much if your API latency is low – But setting a high enough value should be okay as the threadpool itself seems to be lazy.
- Moving to sync was really the right decision for us and can work well for anyone else facing the same problem.
- Even without async, its possible to get great performance from FastAPI – one day we may start async library migration but we do not need to worry about it for next 12-18 months.