We use Memcached as a distributed cache. We also use sharding in MySQL and each customer data is on different shard. The information about which customer is stored on what shard is static and doesn’t change unless we do a manual customer move. Because we would do an automatic move a customer in future I started storing this in memcached instead of JVM so I don’t need to coordinate in memory cache flush across multiple boxes.
We recently moved a large scale system from python to Java and we were getting close to 2K request per minute on every machine. On checking APM tool I found that a lot of the time is spent in memcached.getObject. Checking one of the memcached box out of many I would found CPU on one of the cores would be pegged at 100%. I installed
mctop by Etsy and found that the most used key was the key that would lookup this customer is on what shard
sudo /opt/mctop/bin/mctop --interface=eth0 --port=23456
Inspecting all other data centres showed similar symptoms. So I added code to cache the information in a concurrent hashmap in JVM instead of Memcached. I handled move customer by implementing a distributed flush. Last week the fix went live and immediately I can see the response times improved overall across all services by 1-2 ms. I saw Get Operations drop from 15K to 5K per seconds on Memcahed instances, I also saw a drop in CPU usage on memcached box.
Lesson learnt in high scale systems even Memcached call is not that cheap.
|
Get operations drop on one instance |
|
CPU Usage drop on one instance |
|
Total Operations drop on one instance | |
|
|
|
We observed a similar issue in another part of the system with another out of process call. Each sync request was making a call to a remote Authentication service. We were getting close to 7K request per minute does token based authentication on each pod. Under huge burst that service starts taking 35-200ms instead of 30ms. We made a fix over weekend to remove that call and we can see that system is calm now after the fix went live (vertical line in graph denotes when fix went live).
Comments
Post a Comment