Home

Blog

Advanced Performance Tuning for FastAPI on Google Cloud Run

9 min read

Diagram showing performance tuning knobs like CPU, memory, and concurrency for a FastAPI application on Google Cloud Run.

By David Muraya • September 4, 2025

After deploying your FastAPI application to Google Cloud Run using a CI/CD pipeline, the next step is to optimize its performance. While multi-stage Docker builds create a good foundation, true performance tuning involves managing concurrency, reducing latency, and understanding how Cloud Run scales your instances.

This guide covers advanced techniques for making your FastAPI service faster and more efficient on Cloud Run, focusing on startup times, concurrency, and server configuration.

1. Optimize Container Startup Time

Cloud Run scales instances based on incoming traffic. When a service scales from zero or handles a spike in requests, a new instance must start. This "cold start" adds latency. Minimizing this startup time is critical for a responsive service.

The startup routine involves:

Downloading the container image from the Container Registry.
Starting the container by running its entrypoint command.
Waiting for the container to listen on the configured port.

Here are three effective strategies to speed this up.

Parallelize Startup Tasks

Your application might need to perform several tasks at startup, like pre-loading data, warming up a cache, or establishing database connection pools. If these tasks run one after another (sequentially), they can add significant time to your cold start.

If you configure your Cloud Run service with more than one vCPU, you can potentially run these initialization tasks in parallel to speed up the startup process.

Use Startup CPU Boost

Cloud Run allows you to temporarily increase CPU allocation during instance startup. This can significantly reduce latency caused by loading libraries, initializing connections, or other startup tasks.

You can enable this feature during deployment. It's particularly useful for applications with heavy dependencies.

gcloud run deploy YOUR_SERVICE_NAME \
  --image gcr.io/PROJECT_ID/YOUR_IMAGE \
  --cpu-boost \
  --region=us-central1

Use Minimum Instances

To eliminate cold starts entirely for traffic-facing services, you can configure a minimum number of instances to keep warm and ready to serve requests. This is the most effective way to ensure low latency, but it has cost implications since these instances are always running.

Set a minimum number of instances with the --min-instances flag.

gcloud run deploy YOUR_SERVICE_NAME \
  --image gcr.io/PROJECT_ID/YOUR_IMAGE \
  --min-instances=1 \
  --region=us-central1

2. Optimize Concurrency and Server Configuration

Cloud Run instances can handle multiple requests at the same time. The default concurrency is 80, but you can tune this value. For FastAPI, which is I/O-bound, a higher concurrency is often fine. However, the optimal number depends on your application's workload.

The most important tuning you can do is configuring your ASGI server correctly. For production, you should use a process manager like Gunicorn to manage Uvicorn workers.

A common formula for the number of workers is (2 x Number of Cores) + 1. Since a Cloud Run instance has a specific number of vCPUs, you can set your worker count based on that.

Here is an example CMD for your Dockerfile that configures Gunicorn with Uvicorn workers. This example assumes the instance has 1 vCPU.

# Dockerfile
# ...
CMD ["gunicorn", "-w", "3", "-k", "uvicorn.workers.UvicornWorker", "app.main:app", "--bind", ":$PORT", "--max-requests", "1000", "--max-requests-jitter", "100"]

-w 3: Sets the number of worker processes to 3.
-k uvicorn.workers.UvicornWorker: Tells Gunicorn to use Uvicorn to handle requests.
--max-requests 1000: Restarts a worker after it handles 1000 requests. This is a useful safeguard against memory leaks, where a worker might gradually consume more memory over time. Restarting the worker releases that memory, preventing performance degradation.
--max-requests-jitter 100: Adds a random delay to the restart, preventing all workers from restarting simultaneously.

Use Asynchronous Libraries

To get the most out of FastAPI, you must use asynchronous libraries for I/O-bound operations like database queries and external API calls. If you use a traditional, synchronous library (like the default psycopg2 for PostgreSQL), it will block the entire worker process, preventing it from handling other requests.

Switch to async-native libraries like asyncpg for your database and httpx or aiohttp for making API requests. This allows FastAPI to efficiently manage concurrent requests while waiting for I/O operations to complete.

Enable Unbuffered Logging in Production

In a containerized environment, it's crucial that logs are written directly to stdout and stderr so Cloud Logging can capture them in real-time. Python's output is buffered by default, which can delay or cause the loss of log entries if an instance crashes.

To disable this buffering, set the PYTHONUNBUFFERED environment variable in your Dockerfile.

# Dockerfile
# ...
ENV PYTHONUNBUFFERED=1

CMD ["gunicorn", "-w", "3", "-k", "uvicorn.workers.UvicornWorker", "app.main:app", "--bind", ":$PORT"]

3. Build Minimal Container Images

The smaller your container image, the faster Cloud Run can download and start it. A key part of this is choosing a minimal base image for your Dockerfile. Instead of using the default python:3.11 image, opt for a slimmer variant like python:3.11-slim-bullseye. Alpine-based images (python:3.11-alpine) are even smaller but can sometimes lead to compatibility issues with Python packages that rely on glibc. For most FastAPI applications, the slim variant offers the best balance of size and compatibility.

Another micro-optimization is to pre-compile your Python code to bytecode (.pyc files) during the build. This saves the interpreter a small amount of work at startup.

# In your final stage
# ...
COPY . .
RUN python -m compileall .
# ...

For a detailed guide on creating smaller images, see my article on slimmer FastAPI Docker images with multi-stage builds.

Minimize Files Loaded at Startup

Your application's startup time is also affected by the code and dependencies it loads.

Heavy Dependencies: If you use large libraries like TensorFlow or PyTorch, import only the submodules you need at startup. Consider lazy-loading other parts of the library within specific functions where they are used.

For example, instead of importing a heavy library at the top of your file, which slows down startup:

# This slows down startup because pandas is loaded immediately
import pandas as pd
from fastapi import FastAPI

app = FastAPI()

Import it inside the route that needs it. This way, the library is only loaded into memory when that specific endpoint is called, improving initial startup time.

# main.py
from fastapi import FastAPI

app = FastAPI()

@app.get("/generate-report")
async def generate_report():
    # pandas is imported only when this endpoint is called
    import pandas as pd

    data = {"col1": [1, 2], "col2": [3, 4]}
    df = pd.DataFrame(data=data)

    return {"report_size": len(df)}

Large Assets: For static web assets like images, videos, CSS, and JavaScript files, it is best practice to not serve them from your FastAPI application at all. Instead, upload them to an object store like Google Cloud Storage. You can then serve them directly from the storage bucket or, for better performance, put the bucket behind a Content Delivery Network (CDN) like Google Cloud CDN. This offloads traffic from your Cloud Run instances and delivers content to users faster by caching it at edge locations around the world.

If your application needs large files like AI models, include them in your container image for fast access. But, load them into memory after the application has started and is ready to serve health checks. For non-critical assets like media files, consider using Cloud Storage volume mounts to access them without bloating your container image.

4. Cost Considerations

Your performance tuning choices directly impact cost. Cloud Run offers two billing models:

Request-based billing: You are billed only for the time an instance is actively processing at least one request. Idle instances are not billed. This is the default and is cost-effective for services with intermittent traffic.
Instance-based billing: You are billed for the entire lifetime of an instance, from startup to shutdown, regardless of whether it is processing requests.

Setting --min-instances requires instance-based billing, as you are paying to keep those instances warm. While this improves performance by eliminating cold starts, it also increases costs. You must balance the need for low latency with your budget. Using startup CPU boost, on the other hand, only adds a small cost during the brief startup period.

FAQ

Q: Why use Gunicorn with Uvicorn instead of just running Uvicorn alone? A: Gunicorn acts as a process manager. It can manage multiple Uvicorn worker processes, allowing your application to fully utilize multi-core CPUs and handle more requests in parallel. It also provides robustness by automatically restarting workers that crash, which is essential for a production environment. Running Uvicorn directly is simpler but typically limits you to a single process.

Q: What's the most cost-effective way to reduce cold starts without setting --min-instances=1? A: The most cost-effective method is to enable the startup CPU boost. This accelerates instance startup time without the cost of a continuously running instance. Combining this with a smaller container image and lazy-loading heavy dependencies will give you the best performance improvement for the lowest cost.

Q: How do I decide how much CPU and memory to allocate to my Cloud Run service? A: Start with a baseline, such as 1 vCPU and 512MiB of memory. After deploying, use Google Cloud Monitoring to observe the actual CPU and memory utilization under load. If your service is consistently near its limits, increase the allocation. If it's using very little, you can reduce it to save costs. The right size depends entirely on your application's specific workload.

Q: When should I use --min-instances? A: Use minimum instances for user-facing applications where low latency is critical. For background services or APIs with non-critical response times, the default scaling from zero is more cost-effective.

#python #fastapi #google-cloud #cloud-run #performance-tuning #deployment

Share This Article

About the Author

David Muraya is a Solutions Architect specializing in Python, FastAPI, and Cloud Infrastructure. He is passionate about building scalable, production-ready applications and sharing his knowledge with the developer community. You can connect with him on LinkedIn.

1. Optimize Container Startup Time Parallelize Startup Tasks Use Startup CPU Boost Use Minimum Instances 2. Optimize Concurrency and Server Configuration Use Asynchronous Libraries Enable Unbuffered Logging in Production 3. Build Minimal Container Images Minimize Files Loaded at Startup 4. Cost Considerations FAQ

Back to Blogs

Contact Me

Have a project in mind? Send me an email at hello@davidmuraya.com and let's bring your ideas to life. I am always available for exciting discussions.

Twitter

GitHub

Advanced Performance Tuning for FastAPI on Google Cloud Run

9 min read

By David Muraya • September 4, 2025

1. Optimize Container Startup Time

Parallelize Startup Tasks

Use Startup CPU Boost

Use Minimum Instances

2. Optimize Concurrency and Server Configuration

Use Asynchronous Libraries

Enable Unbuffered Logging in Production

3. Build Minimal Container Images

Minimize Files Loaded at Startup

4. Cost Considerations

FAQ

Share This Article

About the Author

Related Blog Posts

On this page

Contact Me

Advanced Performance Tuning for FastAPI on Google Cloud Run

9 min read

By David Muraya • September 4, 2025

1. Optimize Container Startup Time

Parallelize Startup Tasks

Use Startup CPU Boost

Use Minimum Instances

2. Optimize Concurrency and Server Configuration

Use Asynchronous Libraries

Enable Unbuffered Logging in Production

3. Build Minimal Container Images

Minimize Files Loaded at Startup

4. Cost Considerations

FAQ

.css-s3qwly{margin-right:0.5em;}Share This Article

.css-1fvvno3{margin-right:0.25em;}About the Author

Related Blog Posts

On this page

Contact Me

Share This Article

About the Author