Engineering

All an SRE needs to know : Automation ERA in Distributed Datastores

Core Contributors : Merwin Joseph Biby, Ritik Singhal & Mannoj Saravanan24 December, 2024

URL copied to clipboard

<tl;dr>

– Elasticsearch Upgrade for a cluster takes more than 10hrs by 2SREs manually and entails considerable back and forth.

– SRE has automated the entire flow for rolling upgrades without downtime along with strict guardrails, killswitch, regular slack bots.

– Automation has been executed in prod more than 10 times.

– Now each cluster takes 4 to 6hrs. Also, this can be done parallelly with any no: of clusters, with minimum involvement by an SRE for not more than 30 mins

</tl;dr>

What is Elasticsearch?

Elasticsearch is a search engine that stores data in Lucene index . Lucene index consists of inverted index. Read this for more info.
With this inverted index storage, retrieval capability and data management across nodes, that can horizontally / vertically scale up, scale down; provide data availability, user access management, data life cycle management and logging & plotting graphs by making sense out of it; the entire bundle is called ELK (Elasticsearch, Logstash, and Kibana). For the Foxtrot use case, however, we use only Elasticsearch or Elasticsearch Cluster.

What is an index?

The Term Index is logically a table in traditional database.
Each index will have 2 properties ; Number of Shards and Number of Replica.

i.e. : { index_name : Student ; Shards: 3 ; Replica : 2 }

Shard and Replica: There are 3 Servers/Nodes and the student Index will have 3 Shards and each Shard will have its own replica. Shards are nothing but data chunks/slices.

What happens when Node1 goes down?

In case of failure of one node, you will not have data-loss. Because Elasticsearch will not co-host its replica on the same node as primary (Default).

What are called Roles in Elasticsearch?

Data: Has data of index, shards, primary/replica as we saw above.

Query: Doesn’t store data, it empowers query/ingestion processing layer for applications to use connection pooling to retrieve or put data.

Master: Maintains the metadata of who has what data/shard and which node. Also takes care of node failure and joining and rebalance operations.

What is Foxtrot and its Infra Foot Print on Elasticsearch?

Elasticsearch Architecture For Foxtrot Application :

BM = Baremetal

Node/VM = Virtual Machine

In 1 BM we have 4 VMs/Nodes

What is a BM Aware?

As per above image ,

Data_Yellow ( Primary in BM21.Node1 ; Replica in BM43.Node4)

Data_Red ( Primary in BM87.Node2 ; Replica in BM81.Node3)

Replica data will be spread on a different BM than and not in its Primary’s BM.

Its a cluster level setting , mentioned like below. “allocation” : { “awareness” : { “attributes” : “rack_id” #rack_id is the BM_NUMBER# for us.

Why because chances of BM to go down is higher, and if primary and replica stays in different VMs

Story telling begins:

Let’s get into why the tool exists

Previously, we used Elasticsearch 7.10, and recent perf results for different versions gave us better perf results.

Verdict based on Perf: We had to perform in place upgrade of all clusters from 7.10 to 7.17 and then to 8.9 as there is no direct upgrade to 8.9 from 7.10.

A Look at our Working Architecture :

[The next section is followed by the details]

Outcome is:

Details:

Automation Flow Explanation:

@ UpgradeTime - 2 :

Input all details for the cluster that one wants to upgrade and verify if it’s all intact and make a note of what the cluster intact signal is going to look like.

@ UpgradeTime - 1 :

Ping from Salt Master to all its minions/nodes.
Perform Version checks to all nodes.
Perform config checks if it’s intact with git config of that cluster.

@ UpgradeTime :

What is Cluster intact Flow ?

All nodes are in cluster as per data received @ UpgradeTime - 2 .
Is Cluster green?
( initialize_shards < 10, unassigned_shards = 0)
active_shards_percent_as_number is 100%
Are expected Nodes upgraded to Latest Version?
If all above 5 are good, then give a Green signal.

What is Upgrade Execution Flow ?

On Each VM

Download required package from in-house mirror.
Backup config files to safe location
Stop ES
Upgrade ES
Start ES

Journey Insights :

Written in Python.
It took cumulatively ~35 days to code and contain 3 Phases.
As of now 11 upgrades have been done successfully in prod with this module.
Rollback automation was required in stage environment for continuous testing.
Guardrails were implemented in such a way that at any time not more than 4VMs are down, which is OK since the cluster is BM aware.
Success rate of the upgrade is increased in the mapper phase or `@ UpgradeTime - 1` itself. i.e: before actual upgrade kicks off via salt ping checks, version and discrepancy checks.
Adoption to enable ES Upgrader for other team’s ES Clusters in org is possible. (In progress).
For each test cases/features/tasks, jira has details on why, what and how, so we don’t look back and try to find reasons for the same item again.

Keep Reading

Engineering

Vulnerability Management & Risk Scoring for Business Services

Arnab RoyJanuary 22, 2025

Engineering

Virtual Machine Provisioning and Management in PhonePe

Umed Jadhav, Pratik Anurag and Krishna Thatta PrasanthDecember 23, 2024

Engineering

Nimbus: Flexible BareMetal Provisioning

Tools (SRE) and Infra (SRE) teamJuly 25, 2024