Moving ChartMogul to AWS and Kubernetes

A few months ago, we retired our last pieces of infrastructure on DigitalOcean, marking our migration to AWS as complete. Our journey was not your regular AWS migration as it involved moving our infrastructure from classic VMs to containers orchestrated by Kubernetes.

In a series of articles, we will share our experiences about:

Our journey to AWS EKS (Kubernetes managed service).
Some of the most critical roadblocks we encountered.
Our current stack and tooling.
Our infrastructure plans going forward, hoping they can be helpful to the entire community.

Life with DigitalOcean

Since our inception in 2014 and up to mid-2021, our entire infrastructure has run on DigitalOcean droplets (self-managed cloud virtual machines). We needed a cloud provider to get us off the ground quickly, reliably, and cost-effectively.

DigitalOcean made a lot of sense and was a great choice. We are where we are because of them. That choice gave us the freedom to focus on product building without worrying about scalability and infrastructure complexity – aspects that typically kick in at a later stage.

Every aspect of our infrastructure was provisioned, configured, and managed in-house. We used configuration management and Infrastructure as Code tools (Saltstack and Terraform) to manage things.

We kept growing over the years, and by 2019 we found ourselves looking at a fleet of somewhere around 50 machines in constant need of management, software updates, security patches, and so on. And with new projects in our pipeline, we expected our compute power needs to double by the end of 2020.

Why move and why now?

As great of a choice DigitalOcean was, our organic growth was pushing the boundaries of our setup over the years. We faced challenges with multiple areas, some fixable and preventable, some not.

Various failures

Ad-hoc unannounced maintenance windows that suddenly broke production.
Hardware failures on several occasions, affecting our primary – replica database setup (e.g., droplets entering a ”live migration state to other hardware machines” without notice, meaning 1-2 hours of downtime for that droplet.)
Unexplained networking issues with latency between our machines that DigitalOcean’s support team never cleared out (this was critical for our Postgres read-replicas lag, Redis instances, and HA in general).

AMS2 region deprecation

Our DigitalOcean region (AMS2) was announced as “soon to be retired”, meaning limited support. We could not secure additional resources on-demand, and executing simple tasks usually meant long planning and wasted resources. Simple things such as upgrading a Postgres version and provisioning a new machine to perform a task were becoming impossible to do.

Limited hardware choices

Being in the subscription analytics space means data-intensive operations, large volumes, and the ability to often scale accordingly. Modern machines with more extensive hardware resources were only available in other regions. Network performance degradation was a frequent occurrence, and we soon realized that migrating to a different region was our best bet.

Lack of modern cloud features and managed services

The volume of operational work to maintain our infrastructure to keep up with the growth rate (and deal with tech debt simultaneously) increased. We had to take a hard look at our setup and understand whether moving into a different DigitalOcean region or a new cloud provider was the best choice.

Should we stay or should we go?

We started looking into the benefits of staying with DigitalOcean and simply moving to a new region – a more leisurely, quicker, cheaper, less painful option. But at the same time, we treated this move as an opportunity to modernize parts of our stack in service of expected user growth and an increased rate of progress.

By the end of our assessment, we realized that specific must-have requirements would be hard to achieve by staying and simply switching regions. The most important ones were:

Flexibility in auto-scaling compute resources.
Managed databases.
Provisioning of resources based on temporary usage.
Low(er) latency.
Service interoperability.
Container-based infrastructure with Kubernetes orchestration.

This list of requirements along with the challenges listed in the previous section tipped the scale in favor of switching providers.

Why AWS?

Choosing a new cloud provider to power ChartMogul infrastructure was a long journey. We researched the market and discovered many tradeoffs and advantages a new provider could bring to the table. Our options were Amazon Web Services (AWS), Google Cloud (GCP), and Azure. Ultimately, we decided to go with AWS. We list some of the main reasons below.

Team expertise

We were already using some AWS services in production (e.g., S3 for storing incremental Postgres backups). More importantly, a few of our engineers had prior professional experience using various AWS services extensively in production systems.

Scalability

We can ramp AWS instances up or down at the push of a button.
We can instantly provision resources like RDS databases and compute resources temporarily.
We can iterate through experiments and proof of concepts quickly.
The flexibility and scalability of Kubernetes node pools backed up by EC2 auto-scaling are hard to beat.

Data security and compliance

Data security has always been top of mind. Over the years, AWS security capabilities have grown substantially. The number of new services AWS developed around data security covers most of our needs in the container/Kubernetes space. They play nicely with well-established services such as private VPC isolation, fine-grain control of policies, and IAM roles.

Compliance-wise, we plan to become SOC II certified asap, and we found AWS compliance programs to be an advantage that can help fast-track that journey.

Managed services

Postgres is at the heart of what we do at ChartMogul, and we’ve typically spent a lot of time actively managing our database fleet of machines to support our growth. High availability and reliability of databases were becoming growing concerns, so we decided to evaluate multiple offers from major cloud providers with managed PostgreSQL. AWS RDS was the clear winner.

Managed Kubernetes was another major factor to consider, and this was head to head with Google Cloud (GCP). Google’s managed Kubernetes (GKE) felt better than what AWS had at the time, but comparing RDS to CloudSQL wasn’t close feature-wise. Nowadays it seems that AWS is catching up with EKS however; We benefit from great RDS features such as snapshots flexibility, backup durability (with SLA), read replicas for Postgres, painless upgrades, dedicated IOPS, Cloudwatch metrics, Performance Insights, and the list goes on.

The insane number of AWS services

At the time of writing, AWS offers over 200 services. Most of them give you the ability to get instant access to managed services from so many areas like compute, databases, data analytics, data warehousing, serverless, and storage. Our engineering teams can now leverage top-notch integrations to solve core problems quickly and prioritize buy vs. build where it makes sense.

Disaster Recovery

AWS cloud is an essential part of our Disaster Recovery plan. That’s because instances are easy to spin up, we can promote RDS read-replicas to primary at the click of a button, snapshots are a breeze, we can host in multiple regions, and we have a top-notch integration with our IaC tool of choice.

AWS Credits

We secured $100k worth of credits through the AWS Startup program. We were able to plan, test, and complete our migration without considerable expenses.

Migration to AWS

Our migration from DigitalOcean to AWS was a ten-month-long journey. The entire effort was backed up by volunteers from all of our engineering teams and driven by a DevOps engineer, a backend engineer, and our head of engineering.

Some things involved trial & error. We tried multiple ways of:

Moving data from Postgres to RDS with near-zero downtime.
Moving our app and services from VM-based architecture to containerized ones in Kubernetes.
Fundamentally changing the way we deploy.

A perfect plan was in place, and everything looked good to go on paper, but we learned the hard way that things will not always go to plan. At times, our near-zero downtime migration goal was at serious risk, and back to the drawing board we went.

Perseverance, drive, and fantastic team effort helped us overcome the challenges we faced. Careful planning did wonders too; Given our capacity, we established early on that breaking down the actual migration into three stages (or days) would work best.

Week prior D-day

Start Postgres replication from DigitalOcean to RDS instances.
Review our AWS future production infrastructure.
Configuration of secrets (AWS Parameter Store).
Ensure CI/CD pipelines are ready to deploy to our new Kubernetes clusters.

The day before D-day

Prepare our AWS temp webhook recorder infrastructure (losing events during our migration was not an option).
Move some data in advance (e.g., DigitalOcean Spaces to S3).
Update all Parameter Store secrets to production values.
Prepare DNS changes.
Set all Kubernetes deployments to zero pods to prevent services from accessing production data during migration.

D-day: Flicking the switch

Redirect all webhooks to AWS temporary recorder.
Stop all the services on DigitalOcean.
Wait for Postgres replication to catch up on the latest updates.
Compare DigitalOcean and RDS Postgres data (to ensure integrity and replication catch up).
Drop the subscription from RDS to Postgres running in DigitalOcean.
Create RDS read replicas.
Update our Parameter Store secrets with new RDS endpoints and secrets.
Deploy to Kubernetes and restart PgBouncer to load new configurations.
Switch DNS records for app.chartmogul.com to AWS.

At this point, we were running our production workload on the shiny new infrastructure! We finished the whole thing in 10 hours (we initially estimated 8 hours – not too bad).

Challenges with AWS

The biggest struggle was with the DMS service (AWS managed service to move databases into RDS). It was not as easy to use as advertised. In our case with Postgres, it was not helpful. Eventually, we developed a custom way of moving data into AWS.

We also came to the hard realization that moving databases with zero downtime to AWS with webhook support is complicated. We developed a custom approach to support this setup.

More on these custom approaches in future articles.

Future articles in the series

Look out for future articles documenting our migration journey from DigitalOcean to AWS. We will touch on topics such as:

Why we chose Kubernetes to power ChartMogul.
How we migrated PostgreSQL to RDS.
How we migrated our Rails app to Kubernetes.
How we set up an IPSEC tunnel to AWS VPC.