A crude gRPC benchmark
I wanted to run a very quick and dirty benchmark to see roughly how fast gRPC server-to-server calls are within the same datacenter and over TLS, just to get a feel for what the order of magnitude is.
Bottom line: it takes about 0.6-0.7 ms for a full RPC over TLS from one server to another in Python.
Setup
The code is available on GitHub.
I lifted some Python code and a gRPC method definition from Couchers.org, and basically wrapped it in an otherwise empty servicer that filled the message with dummy data. Then I launched two t3a.micro
(if I recall correctly) boxes in EC2 in us-east-1
on AWS with Ubuntu 20.04 and got a Let’s Encrypt cert for one of them, making that the server. I just did a direct TLS connection from the client to the server and spammed it with requests in an infinite loop of dummy requests, then I timed them with Python’s perf_counter_ns
.
This is not rigorous! If you wanted to do this properly, you should think about it more in detail and control for a bunch of things. For example, in this benchmark, the code is hot in the CPU cache because it’s just being run repeatedly with nothing else going on, which makes it faster than in common deployment situations where there’s a lot of other code being executed as well fighting for cache. You should probably also plot some histograms of latency over multiple runs to get an idea for what the distribution looks like, and so on. A lot of factors will impact this kind of benchmark but here I just wanted to get a ballpark figure for how long this takes.
Why this matters
That’s the end of the actual substance, here’s just some of my recent thoughts.
The gist of it is that if it only takes less than a millisecond to do an RPC call, you can do a lot of RPCs, even in the most latency-sensitive apps.
If you’re going to make a fast service, you obviously need to be very careful about how long your servers take to generate a response. There’s a fairly well known document called “Latency Numbers Every Programmer Should Know”, often credited to Jeff Dean of Google. I’ve seen it before but I stumbled across it again in the past few days, and started thinking about some dilemmas of engineering very fast backend APIs.
Say you have a budget of 20 ms, or 50 ms, or even 100 ms per response; that’s what you define as an acceptable latency per API call: from the time the user sends a request to the time they get a response back. How do you make that happen? On a project I’m working on currently (Couchers.org), for example: this is just impossible currently and given our architecture. We have a single monolithic app with a server on the East Coast of the US, and so if you’re in Australia, you’ll be having round-trip times of 200 ms. This is inevitable: a round trip over fiber takes at least 1 ms per 100 km (speed travels in fiber at around 200,000 km/s, you need to go there and back, so that gives 100 km per ms). Add in a bit for switching, other inefficiencies, and the fact that you don’t go along a straight line — you generally end up pretty close to the theoretical minimum latency. We’d need to get closer to the user.
But getting closer to the user has issues. Essentially any form of organizing and storing your data still requires that your writes are somehow reconciled at some single point on the planet (you can win a bit with smart algorithms and better data structures, or hosting that user’s source of truth data near the user, but this is the case for most apps). In our case, data is kept in a source-of-truth Postgres database.
However, not all is lost. You can still speed up reads by caching results near the user. But it’s rarely the case that the user does the same exact API call repeatedly in short succession, so that probably isn’t very fruitful. So what you want to do instead, is you want to cache data, not one API call at a time, but as units of data that are changed (and by extension invalidated) simultaneously. For example, in Couchers.org we have user profiles and events, and when a user goes on the page of an event, they want to know both general information about the event, each organizer, as well as information about how the user itself relates to the event (whether they’ve indicated they’re attending, if they’re themselves an organizer, etc). Each of these bits of data (we like to often call them resources), can be cached by themselves. The general event data changes rarely, and it’s the same for each user. Each user’s profile information by itself changes rarely. But if you were to cache the full event information as returned by some GetEvent
RPC by itself, that would need to be invalidated each time the event or any user involved in the event changes. So instead you want to cache the general event data and each user’s profile separately. This also shows why caching is famously known as one of the two hard problems in computer science: you’re basically trying to replicate all the ACID-goodness that a relational database guarantees, by trying to trace where data is changed and what should be invalidated, doing it all manually on top of your database. One can see why that’s error prone and difficult.
What does this lead us to? This suggests that maybe we should divide an app into components, each responsible for some small part of the app (e.g. profiles, general event data, user-event relationships), in such a way that we could cache their data individually and then compose these pieces of data together to return full API results (all the information a user needs from their point of view about an event).
Additionally, this architectural style allows the edge server to take the request, and initiate requests to most of its dependent services in parallel, because these services presumably don’t interact much. This can further speed up the app: instead of doing three complicated SQL queries in a monolithic app, we can send three requests to sub-services that maybe do SQL queries, or maybe return data from a cache, but at least it’s all done in parallel. Of course one might argue that you can do all of this parallelism in a monolithic app, which is fair criticism.
And this is how we get back to the speed of a gRPC invocation: if the overhead per API call is sub-millisecond, then we can do a whole lot of RPC calls within every original request. With the addition of parallelism, possibly at multiple levels of depth, this means we can do hundreds of RPC calls just to service one request.
Of course this significantly complicates the app and tearing apart a monolithic, simple app is often overkill. But as the size of the app grows and there are more people involved, these improvements in latency often come hand in hand with improvements in maintainability as each microservice is tended to by a dedicated team that need not worry too much about the internals of other services. There are a few interesting projects that help make this tech more accessible even to smaller teams. One very promising solution for this seems to be fly.io, I’ve been investigating their service, which seems relatively affordable and also comes with a built-in Redis cluster in each region. Maybe you don’t have to be “planet scale” with data centers on every continent and POPs in every city to do this kind of stuff?