{"id":21930,"date":"2024-08-13T11:45:06","date_gmt":"2024-08-13T09:45:06","guid":{"rendered":"https:\/\/chartmogul.com\/blog\/?p=21930"},"modified":"2024-08-15T03:28:30","modified_gmt":"2024-08-15T01:28:30","slug":"autoscaling-sidekiq","status":"publish","type":"post","link":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/","title":{"rendered":"Why we decided to rebuild our Sidekiq infrastructure to support greater scale"},"content":{"rendered":"\n<p>We often exhausted our processing capacity which led to hours-long incidents and long delays for our users. Our on-call personnel had to deal with these incidents manually, but had insufficient tooling. And adding new threads didn\u2019t help. In fact, we often only realized something was wrong several hours later due to missing monitoring. We also often overwhelmed the database leading to poor frontend response times.<\/p>\n\n\n\n<p>These problems were not a consequence of Sidekiq being unperformant or Ruby on Rails not being compiled, but a design flaw of the queues that didn\u2019t consider the most important characteristics of the system. <\/p>\n\n\n\n<p>A year ago we switched to a fundamentally different concept and in this post I&#8217;ll guide you through the approach in detail from the implications on our infrastructure setup, to how we prioritize and measure jobs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Application context<\/strong><\/h2>\n\n\n\n<p>Sidekiq is a great framework for asynchronous processing for Ruby on Rails applications. It uses Redis to store jobs. We pay for the Enterprise version as it provides additional features. It allows us to spread load, even with bursts over time, so that our infrastructure could be used efficiently. Our major bottleneck is the PostgreSQL databases we use to compute &amp; store our analytics data.<\/p>\n\n\n\n<p>Background processing is handy as you don\u2019t have to be always scaled up to handle the peak load on demand, when most of the time you need much less capacity. But at the same time, you also have to be careful not to break important non-functional requirements. One of them is that accounts should not be starved while another account is running a large processing batch.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>How does our application scale?<\/strong><\/h2>\n\n\n\n<p>Before we dive into Sidekiq, it\u2019s important to explain the relevant context of our application. To scale we\u2019ve been using database sharding since the application\u2019s inception:<\/p>\n\n\n\n<p>A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.<\/p>\n\n\n\n<p>\u2014 <a href=\"https:\/\/en.wikipedia.org\/wiki\/Shard_(database_architecture)\" target=\"_blank\" rel=\"noreferrer noopener\">Shard (database architecture)<\/a>, Wikipedia<\/p>\n\n\n\n<p>In our case we split data to separate shards by account. And as our user base grows, we just add a new database shard from time to time.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Shard-selection.png\"><img loading=\"lazy\" decoding=\"async\" width=\"474\" height=\"338\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Shard-selection.png\" alt=\"what's my account ID?\" class=\"wp-image-21933\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Shard-selection.png 474w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Shard-selection-300x214.png 300w\" sizes=\"auto, (max-width: 474px) 100vw, 474px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>The load should spread evenly as long as the accounts behave randomly without huge excesses. This is not always the case, though, and sometimes the main limitation of this approach becomes apparent \u2014 one shard has a fixed capacity and you can\u2019t give an account the processing capacity of more than one shard. There are remedies like vertical scaling of database (ie. upgrade the instance type on AWS), but they\u2019re expensive and optimally we\u2019d like the system to self-regulate.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Sidekiq cluster<\/strong><\/h2>\n\n\n\n<p>Before we start talking about the queues let me explain how we had the infrastructure set up. That way you will have a better idea about the components we\u2019re going to talk about. First let\u2019s have a look at how you can set up a Sidekiq cluster in theory:<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Sidekiq-Cluster.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1002\" height=\"552\" data-id=\"21934\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Sidekiq-Cluster.png\" alt=\"Sidekiq cluster\" class=\"wp-image-21934\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Sidekiq-Cluster.png 1002w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Sidekiq-Cluster-300x165.png 300w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Sidekiq-Cluster-720x397.png 720w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Sidekiq-Cluster-300x165@2x.png 600w\" sizes=\"auto, (max-width: 1002px) 100vw, 1002px\" \/><\/a><\/figure>\n<\/figure>\n\n\n\n<p>The smallest grain of processing capacity in this overview is the <strong>thread<\/strong>. It gives you processing capability for one <strong>job<\/strong> at a time. A job has a <strong>job class<\/strong> and arguments. These are the very basics of Sidekiq. You can think about the thread as the consumer part in the <em>distributed<\/em> <em>producer-consumer<\/em> pattern.<\/p>\n\n\n\n<p>How do you determine which jobs to assign to threads? You configure this for each <strong>process<\/strong>. It has a list of <strong>queues<\/strong> to consume from (eg. Q1, Q2\u2026). Now, you can use<a href=\"https:\/\/github.com\/sidekiq\/sidekiq\/wiki\/Advanced-Options#queues\" target=\"_blank\" rel=\"noreferrer noopener\"> weighted queues<\/a>, but we opted for strict priority as it gives better performance and it\u2019s easier to reason about.<\/p>\n\n\n\n<p>You can have multiple processes per <strong>instance<\/strong>, which will typically live somewhere in the cloud, for example as an AWS EC2 instance. Perhaps you want to set up a certain arrangement of configurations for such an instance &#8211; let\u2019s call it a <strong>role<\/strong>. You can then deploy multiple copies of such instances with the same role to scale. All of this together we call a Sidekiq <strong>cluster<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Original setup<\/strong><\/h2>\n\n\n\n<p>We had multiple queues that were named by what the jobs there were approximately doing. But also a lot of them were named after their teams or subsystems managed by those teams. This unsurprisingly follows <a href=\"https:\/\/en.wikipedia.org\/wiki\/Conway%27s_law\" target=\"_blank\" rel=\"noreferrer noopener\">Conway&#8217;s law<\/a>:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization&#8217;s communication structure.<\/p>\n<\/blockquote>\n\n\n\n<p>We had some queues that implied low or high priority, but also a very highly prioritized <code>default<\/code> queue \u2014 so if you just forgot to name a specific queue, your job class got really lucky. It now had the highest priority in the app, congratulations! We also had a mixture of roles in the cluster, which shuffled these queues in different setups, which looked somewhat random. That often led to unexpected side effects, like a dedicated capacity for low priority queues that ran jobs even during high load.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Queues.png\"><img loading=\"lazy\" decoding=\"async\" width=\"511\" height=\"334\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Queues.png\" alt=\"queues with priority\" class=\"wp-image-21935\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Queues.png 511w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Queues-300x196.png 300w\" sizes=\"auto, (max-width: 511px) 100vw, 511px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>How did we arrive to such a situation? Over the years different developers added new jobs to implement features and always tweaked the queues just a little bit to make their one feature work. However after many such iterations, while the system didn\u2019t have any guiding design principle, it became a very complicated tangle of queues, where nobody understood all the consequences.<\/p>\n\n\n\n<p>Another problem was scalability. Each instance that we had was running several processes with a mixture of queues, several <a href=\"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/instance-optimize-cpu.html\" target=\"_blank\" rel=\"noreferrer noopener\">virtual central processing units<\/a> (vCPUs) and 20 GB of RAM. To consume some queue faster, you had to add expensive instances and we had little idea how efficiently we were using them. The catch is that Ruby processes running on CRuby can leverage just one vCPU disregarding number of internal threads due to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Global_interpreter_lock\" target=\"_blank\" rel=\"noreferrer noopener\">Global Interpreter Lock<\/a>.<\/p>\n\n\n\n<p>Imagine you have 6 vCPUs on an instance, but only 2 processes are consuming queues with jobs:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/CPU-utilization.png\"><img loading=\"lazy\" decoding=\"async\" width=\"471\" height=\"398\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/CPU-utilization.png\" alt=\"utilization of CPU per instance\" class=\"wp-image-21936\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/CPU-utilization.png 471w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/CPU-utilization-300x254.png 300w\" sizes=\"auto, (max-width: 471px) 100vw, 471px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>You\u2019ll see around 33 % utilization, because the other 4 cores have nothing to do. If you don\u2019t know the above-mentioned technical details, you might be left wondering how come the queues are not being consumed while the instance is underutilized. But in reality the ones that can do something, are overwhelmed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Typical incident scenario<\/strong><\/h2>\n\n\n\n<p>To illustrate better our increasing motivation for the new design, let\u2019s imagine a typical incident. You come to work and you see a queue is growing for a couple of hours linearly. No signs of the situation improving. You also see warnings that one shard is very busy, but other shards are not:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Typical-incident.png\"><img loading=\"lazy\" decoding=\"async\" width=\"797\" height=\"583\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Typical-incident.png\" alt=\"typical incident scenario\" class=\"wp-image-21938\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Typical-incident.png 797w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Typical-incident-300x219.png 300w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Typical-incident-720x527.png 720w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Typical-incident-300x219@2x.png 600w\" sizes=\"auto, (max-width: 797px) 100vw, 797px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>Additionally, when you look at pgbouncer connection pool statistics, you see a flat line near the maximum for the overwhelmed shard. You can see client connections waiting to get anything done on the database and when they get there, the queries run excessively long. This means the jobs also run excessively long. Meanwhile all Sidekiq threads that can process the growing queue are busy, while others are sitting idle. All of this points to the fact that our design is not distributing load evenly across shards, so that jobs that could easily get processed on underutilized shards wait behind jobs that struggle on the saturated DB shard.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Introducing: Kubernetes<\/strong><\/h2>\n\n\n\n<p>Previously <a href=\"https:\/\/chartmogul.com\/blog\/moving-chartmogul-to-aws-and-kubernetes\/\" target=\"_blank\" rel=\"noreferrer noopener\">we migrated to AWS<\/a>, and with that we also introduced Kubernetes into our infrastructure stack. With this step we gained new possibilities of infrastructure as code (IaC) and autoscaling. However, to leverage the cloud resources fully and perhaps to save some budget we had to first research a bit and adjust the system.<\/p>\n\n\n\n<p>The original setup was not ready for autoscaling &#8211; the instances were too big and it was not clear what metric to pick for scaling. Even the jobs themselves were not ready as they expected the instances to run continually for 24 hours between deployments. So, we knew we had to get smaller instances and optimize some of the jobs, but what queues should we use?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Learning<\/strong><\/h2>\n\n\n\n<p>Before designing the changes, I already realized that throwing jobs into random queues doesn\u2019t scale. I also noticed we\u2019re lacking a meaningful way to distribute the load generated by our asynchronous processing on the database shards. This became obvious through addressing repeated incidents when a particular database shard got saturated either on CPU or allocated<a href=\"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/ebs-io-characteristics.html\"> I\/O operations per second<\/a> (IOPS). Meanwhile other shards were not loaded, even though we had jobs for those shards sitting for very long time in the queues.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"640\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled-1024x640.png\" alt=\" Sidekiq in Practice\" class=\"wp-image-21940\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled-1024x640.png 1024w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled-300x188.png 300w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled-1536x960.png 1536w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled-720x450.png 720w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled.png 1680w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled-300x188@2x.png 600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>A colleague recommended to me the book &amp; workshop<a href=\"https:\/\/nateberk.gumroad.com\/l\/sidekiqinpractice\"> Sidekiq in Practice<\/a> by Nate Berkopec. It was a very insightful reading that inspired me to rethink the system from ground up.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Thinking about users<\/strong><\/h2>\n\n\n\n<p>Until now we talked about different system parameters, but we stayed completely silent about the final consumers of the analytics \u2014 the users. The ultimate goal of all of our processing is to be ready to show something to the user \u2014 be it a notification, a chart or some numbers in a table. Our users need those to gain insights about their business. And different parts need to be delivered within different timelines. For this we leverage the <a href=\"https:\/\/www.atlassian.com\/incident-management\/kpis\/sla-vs-slo-vs-sli\" target=\"_blank\" rel=\"noreferrer noopener\">SLA-SLO-SLI framework<\/a>:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/SLA-SLO-SLI.png\"><img loading=\"lazy\" decoding=\"async\" width=\"554\" height=\"343\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/SLA-SLO-SLI.png\" alt=\"SLA, SLO, SLI\" class=\"wp-image-21941\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/SLA-SLO-SLI.png 554w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/SLA-SLO-SLI-300x186.png 300w\" sizes=\"auto, (max-width: 554px) 100vw, 554px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>It is important to realize two facts:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Your SLOs must be measurable and that means you must be able to come up with matching SLIs. Otherwise you won\u2019t be able to evaluate impact of changes and you won\u2019t know if they are for the better or worse.<\/li>\n\n\n\n<li>Fulfilling individual SLOs might not correspond to fulfilling the overall SLA. Think about a user waiting for an action to finish \u2014 but the action is composed of multiple jobs. Individual jobs might run fast, but maybe they interact in such a way that the overall action takes long time to finish. That\u2019s why it\u2019s also a good idea to have end-to-end measurements.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Finally, the new design<\/strong><\/h2>\n\n\n\n<p>If you managed to read this far, I have good news! We now have enough background to introduce the new design that solves our issues and adds some perks as a bonus.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Queues-in-diagram-1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"634\" height=\"416\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Queues-in-diagram-1.png\" alt=\"Finally, the new design\" class=\"wp-image-21942\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Queues-in-diagram-1.png 634w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Queues-in-diagram-1-300x197.png 300w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Queues-in-diagram-1-300x197@2x.png 600w\" sizes=\"auto, (max-width: 634px) 100vw, 634px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>The diagram might seem complex \u2014 but in fact it\u2019s nothing more than a couple of standard Sidekiq <em>first-in-first-out<\/em> (FIFO) queues (represented by rectangles). Each row represents queues for jobs that connect to one database shard. Each column represents a priority level, or in other words acceptable delay, or the SLO.<\/p>\n\n\n\n<p>As a consequence of this setup each job must know 2 things when being pushed (created) \u2014 its importance for the user and the shard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <strong>shard<\/strong> is easy, because it is determined by the account. Jobs operating on multiple shards can (and should) always be split into N independent jobs running on a single shard. By removing interdependence between shards, you achieve isolation of queues. Then if an incident happens, the impact to your system will be limited and you have achieved partial availability, which is better than full unavailability.<\/li>\n\n\n\n<li>The <strong>priority<\/strong> is trickier, because now the author\/maintainer of a job class must think about how it will impact SLA. But also other questions, like how many such jobs might be pushed in a burst, what latency characteristics do the jobs have and so forth. To make this long-term maintainable, we created a documentation-heavy class with a sort of a questionnaire that enables developers to evaluate themselves, whether the job class fulfills criteria for higher priority queues.<\/li>\n<\/ul>\n\n\n\n<p>Recently, we have also finished<a href=\"https:\/\/chartmogul.com\/blog\/how-migrating-our-database-eliminated-data-processing-incidents\/\" target=\"_blank\" rel=\"noreferrer noopener\"> sharding our ingestion database<\/a> to remove a single point of failure (SPOF) and as you can see now, it ties nicely in this design, distributing the load over the shards.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Measurement (SLI)<\/strong><\/h2>\n\n\n\n<p>To match the job delay SLO per queue, we need to measure it and compare somehow. We collect the delay of the oldest job in a queue and send it to DataDog to generate insightful dashboards:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Schermafdruk-van-2023-07-14-15-16-43.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"509\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Schermafdruk-van-2023-07-14-15-16-43-1024x509.png\" alt=\"dashboards \" class=\"wp-image-21943\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Schermafdruk-van-2023-07-14-15-16-43-1024x509.png 1024w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Schermafdruk-van-2023-07-14-15-16-43-300x149.png 300w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Schermafdruk-van-2023-07-14-15-16-43-720x358.png 720w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Schermafdruk-van-2023-07-14-15-16-43.png 1413w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Schermafdruk-van-2023-07-14-15-16-43-300x149@2x.png 600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>On this screenshot you can see two of our shards and their 5 queues each. Highest priority on the left, longest acceptable latency on the right. I picked this representation, because for each SLI it shows current value, but also the history and range of the value over a given period (an hour for example). That gives you a good idea of how the situation is developing.<\/p>\n\n\n\n<p>When the latency breaches the SLO, the color changes to <em>orange<\/em>. And as you can see in this example, the shards are indeed isolated, allowing us to maintain SLOs elsewhere, even though shard 05 had issues with a large burst of jobs that we had to manually throttle. On the infrastructure side they also scale separately (see the next section).<\/p>\n\n\n\n<p>I really like how this conveys information about the whole system inside one page with the possibility to drill down to details. If you\u2019re looking for inspiration on how to make great readable charts, then <a href=\"https:\/\/www.edwardtufte.com\/tufte\/books_vdqi\" target=\"_blank\" rel=\"noreferrer noopener\">The Visual Display of Quantitative Information<\/a> by Edward Tufte is a must-read.<\/p>\n\n\n\n<p>But eye candy is not the only purpose and even we at ChartMogul don\u2019t have the whole day to stare at charts. We still want to be aware ASAP when something goes wrong, so when a delay reaches <em>red<\/em> (twice as expected), we trigger an on-call alarm. We could only do this once we became confident enough about fulfilling the SLO. That required a lot of optimization and refactoring of the jobs that now have to live up to higher standards.<\/p>\n\n\n\n<p>But in reality this is not the only dashboard. We still need more that actually measure concrete complex operations from start to end throughout the different services &amp; jobs. Those give better answers on such questions as <em>how long does it take to process a webhook from receiving to showing a notification to the user<\/em>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Autoscaling<\/strong><\/h2>\n\n\n\n<p>As I\u2019ve already mentioned, we automatically scale the queue processing capacity, but how do we do it precisely? We leverage<a href=\"https:\/\/kubernetes.io\/docs\/tasks\/run-application\/horizontal-pod-autoscale\/\" target=\"_blank\" rel=\"noreferrer noopener\"> Horizontal Pod Autoscaling<\/a> (HPA) that controls the number of pods inside an autoscaling group and<a href=\"https:\/\/prometheus.io\/\"> prometheus<\/a> that provides the latency metrics scraped using Sidekiq API.<\/p>\n\n\n\n<p>We cannot scale the database easily, so there are limits on how many connections can be active through <code>pgbouncer<\/code> pools. But we can scale the processing capacity of Sidekiq, so that we pay for less Sidekiq threads and less node instances on AWS. Our current topology looks like this:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Autoscaling.png\"><img loading=\"lazy\" decoding=\"async\" width=\"751\" height=\"516\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Autoscaling.png\" alt=\"autoscaling\" class=\"wp-image-21944\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Autoscaling.png 751w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Autoscaling-300x206.png 300w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Autoscaling-720x495.png 720w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Autoscaling-300x206@2x.png 600w\" sizes=\"auto, (max-width: 751px) 100vw, 751px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>The number of containers in each shard group goes up and down as the demand by the accounts there changes. Meanwhile the number of nodes changes as well according to how many containers overall are required. There are some delays on both levels for practical reasons. For example to reduce disruption you need to tweak the autoscaling policy and introduce a graceful shutdown period.<\/p>\n\n\n\n<p>Now, why did we choose the minimal containers? Each of our containers has just 1 vCPU and a limited amount of memory. This gives us greater flexibility, because we can shift processing capacity granularly, ie. by small increments. In other words, we don\u2019t have to add or remove big virtual machines all the time. It also gives us transparent infrastructure metrics as we can directly observe the utilization of the vCPUs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Threads and memory<\/strong><\/h2>\n\n\n\n<p>Originally we had 20 threads per process. However, using DataDog APM Traces &amp; Container monitoring we found out that the vCPUs were often close to 100 % during peak throughput and jobs took longer than necessary with suspicious gaps in traces. Those gaps were in fact the waiting of the ruby thread to get CPU time. So nowadays we settled on just 4 threads as it seems to be the best for our ratio of I\/O wait to computation. Sidekiq\u2019s default used to be higher, <a href=\"https:\/\/github.com\/sidekiq\/sidekiq\/wiki\/Advanced-Options#concurrency\" target=\"_blank\" rel=\"noreferrer noopener\">but it is now just 5<\/a>.<\/p>\n\n\n\n<p>A bigger container with multiple vCPUs, processes and large memory does have the advantage that occasional memory-hungry jobs still manage to finish without an out-of-memory kill\u2026 as long as they don\u2019t throw a party on one container at the same time, that is. However, we opted to rather limit the impact of such jobs and either fix them or send them to a special <em>quarantine<\/em> queue with extra memory, but less capacity. This means the normal pods can stay lean and optimized for 99.9% of the jobs.<\/p>\n\n\n\n<p>While the following long-term distribution doesn\u2019t tell the whole story, it shows the balance we struck between perfect CPU utilization and minimal latency of jobs caused by CPU saturation:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled-1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"560\" height=\"200\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled-1.png\" alt=\"CPU saturation\" class=\"wp-image-21945\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled-1.png 560w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Untitled-1-300x107.png 300w\" sizes=\"auto, (max-width: 560px) 100vw, 560px\" \/><\/a><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\"><strong>Long-running jobs<\/strong><\/h2>\n\n\n\n<p>Sidekiq has a self-protective mechanism against crashing jobs called <em>poison pills<\/em>. If jobs are detected to be present during multiple crashes, they\u2019re not retried and marked directly as <em>dead<\/em>. We encountered this already with out of memory crashes (OOM). Such jobs can be usually optimized to prevent such errors. But it can also detect some jobs as false positives. And after we migrated to an autoscaled architecture, we realized this was happening to us, here\u2019s why:<\/p>\n\n\n\n<p>If you have autoscaling instances in the cloud, your jobs can\u2019t rely on them existing \u201cforever\u201d. So it\u2019s best if you can split the workload such that the jobs finish quickly. Let\u2019s say our goal is 5 minutes. Then you can also set your graceful shutdown to respect this.<\/p>\n\n\n\n<p>However, what if your job runs for <em>hours<\/em> and it\u2019s hard to refactor? Well, we have put those in a special queue and on a container role that wasn\u2019t autoscaled and we thought the job was done. But we still saw that the jobs were randomly killed, even without the graceful shutdown!<\/p>\n\n\n\n<p>What we didn\u2019t realize at first was that we put it together with other autoscaled containers on the same node group in Kubernetes, so sometimes it decided it was a good idea to reshuffle the long-running pods and unfortunately, right now Kubernetes doesn\u2019t respect graceful period for this. In the end, we moved this job into a separate node group and also started refactoring the job.<\/p>\n\n\n\n<p>One aspect that we could still improve in our design is perhaps not putting the 1 minute through 24 hours SLO queues on the same roles, as they\u2019re mixing widely different latency expectations. So far we circumvented this problem by requiring 5 minutes latency in the 99th percentile for all jobs. But you can easily imagine with a thought experiment a large burst of 5-minute jobs would cause issues in the 1-minute queue quickly, we\u2019ll discuss how we tackle this with rate limiting later on.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What about fair scheduling?<\/strong><\/h2>\n\n\n\n<p>One of the struggle you encounter on a multi-tenant application with shared resources is the competition for these resources. Optimally, you want your application to feel fast for an account that needs to compute a lot of data, however you have to be careful not to starve other accounts:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>resource starvation<\/strong> is a problem encountered in concurrent computing where a process is perpetually denied necessary resources to process its work.<\/p>\n<\/blockquote>\n\n\n\n<p>\u2014<a href=\"https:\/\/en.wikipedia.org\/wiki\/Starvation_(computer_science)\"> Starvation (computer science)<\/a>, Wikipedia<\/p>\n\n\n\n<p>It\u2019s an interesting problem, so let me go a bit deeper into it. We attempted to solve this puzzle once and for all in 2018 with<a href=\"https:\/\/github.com\/chartmogul\/sidekiq-priority_queue\" target=\"_blank\" rel=\"noreferrer noopener\"> a custom module<\/a> for Sidekiq that introduced prioritized queues, it worked like this (green = jobs added, red = jobs removed):<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Prioritized-queues.png\"><img loading=\"lazy\" decoding=\"async\" width=\"448\" height=\"247\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Prioritized-queues.png\" alt=\"prioritized jobs\" class=\"wp-image-21946\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Prioritized-queues.png 448w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Prioritized-queues-300x165.png 300w\" sizes=\"auto, (max-width: 448px) 100vw, 448px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>When inserting new jobs, they got into a subqueue per account ID. We kept track of how many jobs there were for an account. New jobs then got a higher score. We used Redis command<a href=\"https:\/\/redis.io\/commands\/zpopmin\/\" target=\"_blank\" rel=\"noreferrer noopener\"> ZPOPMIN<\/a> to process the jobs with lowest score first. This works! Job done?<\/p>\n\n\n\n<p>Unfortunately, no. On the face of it, this algorithm is very fair to all accounts (at least in regards to number of jobs). However it has pitfalls. First let\u2019s see the example where the queue stops working as FIFO:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Not-FIFO.png\"><img loading=\"lazy\" decoding=\"async\" width=\"350\" height=\"377\" src=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Not-FIFO.png\" alt=\"breaking the FIFO expectation\" class=\"wp-image-21947\" srcset=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Not-FIFO.png 350w, https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/Not-FIFO-279x300.png 279w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>As you can see, the newest job gets processed first, breaking the FIFO expectation. This is not terrible if you consider retries for example, you must prepare the jobs to process out of order. But it makes it harder to reason about when a particular job would be processed. The worse part is that this applied on all the accounts, so if an account reached a high number of jobs, eg. 50000+, such jobs could get stuck there until there was literally nothing else to do.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Throttling &amp; rate limiting<\/strong><\/h3>\n\n\n\n<p>Nowadays we solve this issue with two out-of-the-box mechanisms instead:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>throttling<\/strong>: on large workloads we limit how many jobs we <em>push<\/em>. This can be done for example with Sidekiq Batch, where you push a 1000 jobs, then push next batch of 1000 when these had a chance to process. That way you always leave some capacity for other accounts. Be sure to use <code>perform_bulk<\/code> (and save 999 roundtrips to Redis). Alternatively, we also throttle how many we push per minute per account, but that can still lead to saturation if the jobs take longer than expected to process.<\/li>\n\n\n\n<li><strong>rate limiting<\/strong>: when the jobs are already running (or waiting in the queues), you can check against a quota, whether they\u2019re not using excessive resources. And in that case you make them retry later (with exponential backoff). There are multiple algorithms supported in Sidekiq. You can count number of jobs, number of requests to 3rd party APIs or any other metric you can measure. Sidekiq 7.1 introduced<a href=\"https:\/\/github.com\/sidekiq\/sidekiq\/wiki\/Ent-Rate-Limiting#points\" target=\"_blank\" rel=\"noreferrer noopener\"> a leaky bucket points-based limiter<\/a>. So it\u2019s possible for example to measure the duration of jobs and assign maximum to each account, or even give on-call developers the ability to dynamically lower in case of unexpected issues. This can be really handy in case the jobs are not numerous, but take unexpectedly long time to process.<\/li>\n<\/ul>\n\n\n\n<p>Combining these two approaches provides smoothing of bursts along multiple dimensions and makes the system more stable without manual interventions. The downside is more load on Redis, but if these features are not used excessively, it is actually very low.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>We started with a system that emerged organically and we didn\u2019t understand it very well. It caused issues and hours-long unnecessary latencies. After we analyzed the issues, we realized it was necessary to better understand the framework itself (read a book) and our system\u2019s scalability.<\/p>\n\n\n\n<p>It turned out to be the right choice, because we managed to squeeze more performance from similar hardware, benefiting from boosted capacity during peak load. Our application is also increasing throughput of jobs, while we dramatically reduced incidents. All this while not migrating to another framework or language, which saved significant costs of learning new technologies.<\/p>\n\n\n\n<p>We learned to better understand the jobs that we run in the cloud, measuring latency instead of queue length. We started bridging the gap between users and the system, optimizing for efficient usage of our main bottleneck \u2014 the database \u2014 but also for better user experience (lower latency on important jobs).<\/p>\n\n\n\n<p>As a result, we have gained greater partial availability, limiting the impact of issues, and we intend to apply similar approach to other parts of our web service as well. There are now guidelines and an easy-to-understand system for adding new jobs, eliminating the need to add new special queues for special features. Now, we have a much better insight into the health of the system and with the new operational tools we are better equipped to handle unforeseen circumstances. All of this makes the user experience better, but it also makes developers\u2019 lives easier and gives us more time to focus on developing features.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Two years ago we were struggling with background processing which was a problem for our application.<\/p>\n","protected":false},"author":19,"featured_media":21931,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13],"tags":[25,282],"class_list":["post-21930","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-how-we-build","tag-chartmogul","tag-engineering"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.8 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Why we decided to rebuild our Sidekiq infrastructure to support greater scale | ChartMogul<\/title>\n<meta name=\"description\" content=\"We rebuilt our Sidekiq infrastructure and now have insight into the health of the system and are equipped to handle unforeseen circumstances.\" \/>\n<meta name=\"robots\" content=\"index, follow\" \/>\n<link rel=\"canonical\" href=\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Why we decided to rebuild our Sidekiq infrastructure to support greater scale | ChartMogul\" \/>\n<meta property=\"og:description\" content=\"We rebuilt our Sidekiq infrastructure and now have insight into the health of the system and are equipped to handle unforeseen circumstances.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/\" \/>\n<meta property=\"og:site_name\" content=\"ChartMogul\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/chartmogul\" \/>\n<meta property=\"article:published_time\" content=\"2024-08-13T09:45:06+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-08-15T01:28:30+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/blog_performance_improvement-1024x427.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"427\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Petr Kopac\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@kopacpetr\" \/>\n<meta name=\"twitter:site\" content=\"@chartmogul\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Petr Kopac\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"20 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/\"},\"author\":{\"name\":\"Petr Kopac\",\"@id\":\"https:\/\/chartmogul.com\/blog\/#\/schema\/person\/a3a7b867d81dd345fb48427dd350e720\"},\"headline\":\"Why we decided to rebuild our Sidekiq infrastructure to support greater scale\",\"datePublished\":\"2024-08-13T09:45:06+00:00\",\"dateModified\":\"2024-08-15T01:28:30+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/\"},\"wordCount\":3867,\"publisher\":{\"@id\":\"https:\/\/chartmogul.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/blog_performance_improvement.png\",\"keywords\":[\"chartmogul\",\"engineering\"],\"articleSection\":[\"How we Build\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/\",\"url\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/\",\"name\":\"Why we decided to rebuild our Sidekiq infrastructure to support greater scale | ChartMogul\",\"isPartOf\":{\"@id\":\"https:\/\/chartmogul.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/blog_performance_improvement.png\",\"datePublished\":\"2024-08-13T09:45:06+00:00\",\"dateModified\":\"2024-08-15T01:28:30+00:00\",\"description\":\"We rebuilt our Sidekiq infrastructure and now have insight into the health of the system and are equipped to handle unforeseen circumstances.\",\"breadcrumb\":{\"@id\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#primaryimage\",\"url\":\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/blog_performance_improvement.png\",\"contentUrl\":\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/blog_performance_improvement.png\",\"width\":3000,\"height\":1250,\"caption\":\"(blog)_performance_improvement\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/chartmogul.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Why we decided to rebuild our Sidekiq infrastructure to support greater scale\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/chartmogul.com\/blog\/#website\",\"url\":\"https:\/\/chartmogul.com\/blog\/\",\"name\":\"ChartMogul\",\"description\":\"Get all your SaaS &amp; Subscription Metrics with a Single Click! MRR, churn, LTV and much more.\",\"publisher\":{\"@id\":\"https:\/\/chartmogul.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/chartmogul.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/chartmogul.com\/blog\/#organization\",\"name\":\"ChartMogul\",\"url\":\"https:\/\/chartmogul.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/chartmogul.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2019\/05\/ChartMogul-Logo.png\",\"contentUrl\":\"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2019\/05\/ChartMogul-Logo.png\",\"width\":278,\"height\":52,\"caption\":\"ChartMogul\"},\"image\":{\"@id\":\"https:\/\/chartmogul.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/chartmogul\",\"https:\/\/x.com\/chartmogul\",\"https:\/\/www.linkedin.com\/company\/chartmogul\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/chartmogul.com\/blog\/#\/schema\/person\/a3a7b867d81dd345fb48427dd350e720\",\"name\":\"Petr Kopac\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/chartmogul.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/e9d605dbd4244f51e2a7ea2630cf0132199e0c7ea16437d9792b81c15ea35816?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/e9d605dbd4244f51e2a7ea2630cf0132199e0c7ea16437d9792b81c15ea35816?s=96&d=mm&r=g\",\"caption\":\"Petr Kopac\"},\"sameAs\":[\"https:\/\/x.com\/kopacpetr\"],\"url\":\"https:\/\/chartmogul.com\/blog\/author\/petr\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Why we decided to rebuild our Sidekiq infrastructure to support greater scale | ChartMogul","description":"We rebuilt our Sidekiq infrastructure and now have insight into the health of the system and are equipped to handle unforeseen circumstances.","robots":{"index":"index","follow":"follow"},"canonical":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/","og_locale":"en_US","og_type":"article","og_title":"Why we decided to rebuild our Sidekiq infrastructure to support greater scale | ChartMogul","og_description":"We rebuilt our Sidekiq infrastructure and now have insight into the health of the system and are equipped to handle unforeseen circumstances.","og_url":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/","og_site_name":"ChartMogul","article_publisher":"https:\/\/www.facebook.com\/chartmogul","article_published_time":"2024-08-13T09:45:06+00:00","article_modified_time":"2024-08-15T01:28:30+00:00","og_image":[{"width":1024,"height":427,"url":"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/blog_performance_improvement-1024x427.png","type":"image\/png"}],"author":"Petr Kopac","twitter_card":"summary_large_image","twitter_creator":"@kopacpetr","twitter_site":"@chartmogul","twitter_misc":{"Written by":"Petr Kopac","Est. reading time":"20 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#article","isPartOf":{"@id":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/"},"author":{"name":"Petr Kopac","@id":"https:\/\/chartmogul.com\/blog\/#\/schema\/person\/a3a7b867d81dd345fb48427dd350e720"},"headline":"Why we decided to rebuild our Sidekiq infrastructure to support greater scale","datePublished":"2024-08-13T09:45:06+00:00","dateModified":"2024-08-15T01:28:30+00:00","mainEntityOfPage":{"@id":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/"},"wordCount":3867,"publisher":{"@id":"https:\/\/chartmogul.com\/blog\/#organization"},"image":{"@id":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#primaryimage"},"thumbnailUrl":"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/blog_performance_improvement.png","keywords":["chartmogul","engineering"],"articleSection":["How we Build"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/","url":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/","name":"Why we decided to rebuild our Sidekiq infrastructure to support greater scale | ChartMogul","isPartOf":{"@id":"https:\/\/chartmogul.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#primaryimage"},"image":{"@id":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#primaryimage"},"thumbnailUrl":"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/blog_performance_improvement.png","datePublished":"2024-08-13T09:45:06+00:00","dateModified":"2024-08-15T01:28:30+00:00","description":"We rebuilt our Sidekiq infrastructure and now have insight into the health of the system and are equipped to handle unforeseen circumstances.","breadcrumb":{"@id":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#primaryimage","url":"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/blog_performance_improvement.png","contentUrl":"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2024\/08\/blog_performance_improvement.png","width":3000,"height":1250,"caption":"(blog)_performance_improvement"},{"@type":"BreadcrumbList","@id":"https:\/\/chartmogul.com\/blog\/autoscaling-sidekiq\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/chartmogul.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Why we decided to rebuild our Sidekiq infrastructure to support greater scale"}]},{"@type":"WebSite","@id":"https:\/\/chartmogul.com\/blog\/#website","url":"https:\/\/chartmogul.com\/blog\/","name":"ChartMogul","description":"Get all your SaaS &amp; Subscription Metrics with a Single Click! MRR, churn, LTV and much more.","publisher":{"@id":"https:\/\/chartmogul.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/chartmogul.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/chartmogul.com\/blog\/#organization","name":"ChartMogul","url":"https:\/\/chartmogul.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/chartmogul.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2019\/05\/ChartMogul-Logo.png","contentUrl":"https:\/\/chartmogul.com\/blog\/wp-content\/uploads\/2019\/05\/ChartMogul-Logo.png","width":278,"height":52,"caption":"ChartMogul"},"image":{"@id":"https:\/\/chartmogul.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/chartmogul","https:\/\/x.com\/chartmogul","https:\/\/www.linkedin.com\/company\/chartmogul\/"]},{"@type":"Person","@id":"https:\/\/chartmogul.com\/blog\/#\/schema\/person\/a3a7b867d81dd345fb48427dd350e720","name":"Petr Kopac","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/chartmogul.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/e9d605dbd4244f51e2a7ea2630cf0132199e0c7ea16437d9792b81c15ea35816?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e9d605dbd4244f51e2a7ea2630cf0132199e0c7ea16437d9792b81c15ea35816?s=96&d=mm&r=g","caption":"Petr Kopac"},"sameAs":["https:\/\/x.com\/kopacpetr"],"url":"https:\/\/chartmogul.com\/blog\/author\/petr\/"}]}},"_links":{"self":[{"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/posts\/21930","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/users\/19"}],"replies":[{"embeddable":true,"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/comments?post=21930"}],"version-history":[{"count":10,"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/posts\/21930\/revisions"}],"predecessor-version":[{"id":21961,"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/posts\/21930\/revisions\/21961"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/media\/21931"}],"wp:attachment":[{"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/media?parent=21930"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/categories?post=21930"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/chartmogul.com\/blog\/wp-json\/wp\/v2\/tags?post=21930"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}