After deploying your FastAPI application to Google Cloud Run using a CI/CD pipeline, the next step is to optimize its performance. While multi-stage Docker builds create a good foundation, true performance tuning involves managing concurrency, reducing latency, and understanding how Cloud Run scales your instances.
This guide covers advanced techniques for making your FastAPI service faster and more efficient on Cloud Run, focusing on startup times, concurrency, and server configuration.
Cloud Run scales instances based on incoming traffic. When a service scales from zero or handles a spike in requests, a new instance must start. This "cold start" adds latency. Minimizing this startup time is critical for a responsive service.
The startup routine involves:
Here are three effective strategies to speed this up.
Your application might need to perform several tasks at startup, like pre-loading data, warming up a cache, or establishing database connection pools. If these tasks run one after another (sequentially), they can add significant time to your cold start.
If you configure your Cloud Run service with more than one vCPU, you can potentially run these initialization tasks in parallel to speed up the startup process.
Cloud Run allows you to temporarily increase CPU allocation during instance startup. This can significantly reduce latency caused by loading libraries, initializing connections, or other startup tasks.
You can enable this feature during deployment. It's particularly useful for applications with heavy dependencies.
gcloud run deploy YOUR_SERVICE_NAME \ --image gcr.io/PROJECT_ID/YOUR_IMAGE \ --cpu-boost \ --region=us-central1
To eliminate cold starts entirely for traffic-facing services, you can configure a minimum number of instances to keep warm and ready to serve requests. This is the most effective way to ensure low latency, but it has cost implications since these instances are always running.
Set a minimum number of instances with the --min-instances
flag.
gcloud run deploy YOUR_SERVICE_NAME \ --image gcr.io/PROJECT_ID/YOUR_IMAGE \ --min-instances=1 \ --region=us-central1
Cloud Run instances can handle multiple requests at the same time. The default concurrency is 80, but you can tune this value. For FastAPI, which is I/O-bound, a higher concurrency is often fine. However, the optimal number depends on your application's workload.
The most important tuning you can do is configuring your ASGI server correctly. For production, you should use a process manager like Gunicorn to manage Uvicorn workers.
A common formula for the number of workers is (2 x Number of Cores) + 1
. Since a Cloud Run instance has a specific number of vCPUs, you can set your worker count based on that.
Here is an example CMD
for your Dockerfile that configures Gunicorn with Uvicorn workers. This example assumes the instance has 1 vCPU.
# Dockerfile # ... CMD ["gunicorn", "-w", "3", "-k", "uvicorn.workers.UvicornWorker", "app.main:app", "--bind", ":$PORT", "--max-requests", "1000", "--max-requests-jitter", "100"]
-w 3
: Sets the number of worker processes to 3.-k uvicorn.workers.UvicornWorker
: Tells Gunicorn to use Uvicorn to handle requests.--max-requests 1000
: Restarts a worker after it handles 1000 requests. This is a useful safeguard against memory leaks, where a worker might gradually consume more memory over time. Restarting the worker releases that memory, preventing performance degradation.--max-requests-jitter 100
: Adds a random delay to the restart, preventing all workers from restarting simultaneously.To get the most out of FastAPI, you must use asynchronous libraries for I/O-bound operations like database queries and external API calls. If you use a traditional, synchronous library (like the default psycopg2
for PostgreSQL), it will block the entire worker process, preventing it from handling other requests.
Switch to async-native libraries like asyncpg
for your database and httpx
or aiohttp
for making API requests. This allows FastAPI to efficiently manage concurrent requests while waiting for I/O operations to complete.
In a containerized environment, it's crucial that logs are written directly to stdout
and stderr
so Cloud Logging can capture them in real-time. Python's output is buffered by default, which can delay or cause the loss of log entries if an instance crashes.
To disable this buffering, set the PYTHONUNBUFFERED
environment variable in your Dockerfile.
# Dockerfile # ... ENV PYTHONUNBUFFERED=1 CMD ["gunicorn", "-w", "3", "-k", "uvicorn.workers.UvicornWorker", "app.main:app", "--bind", ":$PORT"]
The smaller your container image, the faster Cloud Run can download and start it. A key part of this is choosing a minimal base image for your Dockerfile. Instead of using the default python:3.11
image, opt for a slimmer variant like python:3.11-slim-bullseye
. Alpine-based images (python:3.11-alpine
) are even smaller but can sometimes lead to compatibility issues with Python packages that rely on glibc
. For most FastAPI applications, the slim
variant offers the best balance of size and compatibility.
Another micro-optimization is to pre-compile your Python code to bytecode (.pyc
files) during the build. This saves the interpreter a small amount of work at startup.
# In your final stage # ... COPY . . RUN python -m compileall . # ...
For a detailed guide on creating smaller images, see my article on slimmer FastAPI Docker images with multi-stage builds.
Your application's startup time is also affected by the code and dependencies it loads.
Heavy Dependencies: If you use large libraries like TensorFlow or PyTorch, import only the submodules you need at startup. Consider lazy-loading other parts of the library within specific functions where they are used.
For example, instead of importing a heavy library at the top of your file, which slows down startup:
# This slows down startup because pandas is loaded immediately import pandas as pd from fastapi import FastAPI app = FastAPI()
Import it inside the route that needs it. This way, the library is only loaded into memory when that specific endpoint is called, improving initial startup time.
# main.py from fastapi import FastAPI app = FastAPI() @app.get("/generate-report") async def generate_report(): # pandas is imported only when this endpoint is called import pandas as pd data = {"col1": [1, 2], "col2": [3, 4]} df = pd.DataFrame(data=data) return {"report_size": len(df)}
Large Assets: For static web assets like images, videos, CSS, and JavaScript files, it is best practice to not serve them from your FastAPI application at all. Instead, upload them to an object store like Google Cloud Storage. You can then serve them directly from the storage bucket or, for better performance, put the bucket behind a Content Delivery Network (CDN) like Google Cloud CDN. This offloads traffic from your Cloud Run instances and delivers content to users faster by caching it at edge locations around the world.
If your application needs large files like AI models, include them in your container image for fast access. But, load them into memory after the application has started and is ready to serve health checks. For non-critical assets like media files, consider using Cloud Storage volume mounts to access them without bloating your container image.
Your performance tuning choices directly impact cost. Cloud Run offers two billing models:
Setting --min-instances
requires instance-based billing, as you are paying to keep those instances warm. While this improves performance by eliminating cold starts, it also increases costs. You must balance the need for low latency with your budget. Using startup CPU boost, on the other hand, only adds a small cost during the brief startup period.
Q: Why use Gunicorn with Uvicorn instead of just running Uvicorn alone? A: Gunicorn acts as a process manager. It can manage multiple Uvicorn worker processes, allowing your application to fully utilize multi-core CPUs and handle more requests in parallel. It also provides robustness by automatically restarting workers that crash, which is essential for a production environment. Running Uvicorn directly is simpler but typically limits you to a single process.
Q: What's the most cost-effective way to reduce cold starts without setting --min-instances=1
?
A: The most cost-effective method is to enable the startup CPU boost. This accelerates instance startup time without the cost of a continuously running instance. Combining this with a smaller container image and lazy-loading heavy dependencies will give you the best performance improvement for the lowest cost.
Q: How do I decide how much CPU and memory to allocate to my Cloud Run service? A: Start with a baseline, such as 1 vCPU and 512MiB of memory. After deploying, use Google Cloud Monitoring to observe the actual CPU and memory utilization under load. If your service is consistently near its limits, increase the allocation. If it's using very little, you can reduce it to save costs. The right size depends entirely on your application's specific workload.
Q: When should I use --min-instances
?
A: Use minimum instances for user-facing applications where low latency is critical. For background services or APIs with non-critical response times, the default scaling from zero is more cost-effective.
David Muraya is a Solutions Architect specializing in Python, FastAPI, and Cloud Infrastructure. He is passionate about building scalable, production-ready applications and sharing his knowledge with the developer community. You can connect with him on LinkedIn.
Enjoyed this blog post? Check out these related posts!
Deploying Reflex Front-End with Caddy in Docker
A step-by-step guide to building and serving Reflex static front-end files using Caddy in a Docker container
Read More..
Building a Flexible Memcached Client for FastAPI
Flexible, Safe, and Efficient Caching for FastAPI with Memcached and aiomcache
Read More..
Reflex Makes SEO Easier: Automatic robots.txt and sitemap.xml Generation
Discover how adding your deploy URL in Reflex automatically generates robots.txt and sitemap.xml for easier SEO.
Read More..
FastAPI Tutorial: A Complete Guide for Beginners
A step-by-step guide to building your first API with Python and FastAPI, from installation to production-ready concepts.
Read More..
Have a project in mind? Send me an email at hello@davidmuraya.com and let's bring your ideas to life. I am always available for exciting discussions.