Improving Maintenance and Observability of an HR Platform Running on Istio
A provider of human resources software—processing $10 billion annually—turned to Swoom for upgrading its platform, aiming to keep critical components operational during migration.
About the project
Brief results of the collaboration:
- A service mesh based on Istio was updated without any downtime, keeping critical HR services secure and operational for 1,400+ organizations.
- With an upgraded platform, the company was able to improve maintainability and transparency, resuming the development of new services.
- By employing Swoom-provided best practices and recommendations, the organization is able to troubleshoot services much easier, as well as benefit from the flexibility, scalability, and high availability provided by Amazon EKS.
The customer
Based in the USA, the customer is a full-service human resources (HR) platform provider. The company aims to solve some of the largest HR problems mid-sized organizations face, such as talent and time management, benefits administration, payroll, etc. The customer’s main product is an HR platform with an open API utilized by 60+ third-party developers. The platform serves 1,400+ companies nationwide and processes more than $10 billion in annual payroll.
The need
The company’s services were running on Amazon Elastic Kubernetes Service (EKS) since 2018. After two years in production, the in-house team experienced issues upgrading Istio, the platform’s service mesh, due to a lack of transparency, official documentation, and focused expertise with the product. This meant that critical services providing observability, traffic management, security, etc., would remain outdated and eventually run into incompatibility issues.
The company turned to Swoom, a certified Kubernetes solutions provider and an Amazon partner, for assistance in updating the service mesh.
The challenges
Under the project, the team at Swoom had to address the following issues:
- The lack of documentation in upgrading Istio v1.4 to v1.5 and v1.5 to v1.6 made it difficult to create a clear path for the update process.
- The HR platform was constantly under load, meaning Istio had to be updated without any downtime.
The solution
1. Evaluation
Along with the customer, engineers at Swoom assessed the existing Istio v1.4 service mesh and outlined an update strategy.
Along with the customer, engineers at Swoom assessed the existing Istio v1.4 service mesh and outlined an update strategy.
Without any official documentation for upgrading Istio v1.4 to v1.5 and v1.5 to v1.6, as well as multiple incompatibility issues between versions due to a shift from microservices in v1.4 to a monolithic model in v1.5, our developers opted to bypass v1.5 and upgrade directly to Istio v1.6.
Stage 2. Migration
To ensure a smooth and seamless upgrade, our team performed the update using a canary deployment, a new feature added in Istio v1.6. In this manner, our DevOps experts deployed a new Istio v1.6 control plane that ran in parallel with the existing Istio v1.4 control plane.
While both control planes were up and running, engineers at Swoom were able to shift a portion of the customer’s workloads to the Istio v1.6 control plane and monitor the effects. This process enabled our team to run exhaustive tests and resolve any issues before redirecting all of the company’s traffic to the upgraded control plane. This way, our developers performed the entire upgrade process without experiencing any downtime.
To ensure a smooth and seamless upgrade, our team performed the update using a canary deployment, a new feature added in Istio v1.6. In this manner, our DevOps experts deployed a new Istio v1.6 control plane that ran in parallel with the existing Istio v1.4 control plane.
While both control planes were up and running, engineers at Swoom were able to shift a portion of the customer’s workloads to the Istio v1.6 control plane and monitor the effects. This process enabled our team to run exhaustive tests and resolve any issues before redirecting all of the company’s traffic to the upgraded control plane. This way, our developers performed the entire upgrade process without experiencing any downtime.
Stage 3. Training
With the service mesh updated, DevOps experts at Swoom facilitated knowledge transfer with the in-house team. This introduced the customer to Kubernetes best practices, such as creating different namespaces for each team to isolate network resources, adding virtual services for each app to make it easier to troubleshoot errors, etc.
Critically, our team also shared the knowledge and experience the company needed to keep their service mesh up-to-date, enabling them to take advantage of new features and services.
With the service mesh updated, DevOps experts at Swoom facilitated knowledge transfer with the in-house team. This introduced the customer to Kubernetes best practices, such as creating different namespaces for each team to isolate network resources, adding virtual services for each app to make it easier to troubleshoot errors, etc.
Critically, our team also shared the knowledge and experience the company needed to keep their service mesh up-to-date, enabling them to take advantage of new features and services.
The outcome
Partnering with Swoom, the company successfully upgraded its Istio service mesh to v1.6 with zero downtime, while also providing their in-house team with the expertise needed to keep the platform up-do-date. With an updated service mesh, the customer can ensure its HR platform that serves 1,400+ mid-sized companies and processes over $10 billion in payroll annually remain operational and secure. The organization now has the expertise to perform upgrades, develop new services, and enact further improvements by implementing recommendations and best practices shared by Swoom.