Engineering

All an SRE needs to know : Automation ERA in Distributed Datastores

Core Contributors : Merwin Joseph Biby, Ritik Singhal & Mannoj Saravanan24 December, 2024

URL copied to clipboard

<tl;dr>

– Elasticsearch Upgrade for a cluster takes more than 10hrs by 2SREs manually and entails considerable back and forth.

– SRE has automated the entire flow for rolling upgrades without downtime along with strict guardrails, killswitch, regular slack bots.

– Automation has been executed in prod more than 10 times.

– Now each cluster takes 4 to 6hrs. Also, this can be done parallelly with any no: of clusters, with minimum involvement by an SRE for not more than 30 mins

</tl;dr>

What is Elasticsearch?  

  • Elasticsearch is a search engine that stores data in Lucene index . Lucene index consists of inverted index. Read this for more info.
  • With this inverted index storage, retrieval capability and data management across nodes, that can horizontally / vertically scale up, scale down; provide data availability, user access management, data life cycle management and logging & plotting graphs by making sense out of it; the entire bundle is called ELK (Elasticsearch, Logstash, and Kibana). For the Foxtrot use case, however, we use only Elasticsearch or Elasticsearch Cluster.

What is an index?

  • The Term Index  is logically a table in traditional database. 
  • Each index will have 2 properties ;  Number of Shards and Number of Replica. 

                i.e. :              { index_name : Student ; Shards: 3 ;  Replica : 2 } 

Shard and Replica: There are 3 Servers/Nodes and the student Index will have 3 Shards and each Shard will have its own replica. Shards are nothing but data chunks/slices.

What happens when Node1 goes down? 

In case of failure of one node, you will not have data-loss. Because Elasticsearch will not co-host its replica on the same node as primary (Default).

What are called Roles in Elasticsearch?

Data: Has data of index, shards, primary/replica as we saw above.

Query: Doesn’t store data, it empowers query/ingestion processing layer for applications to use connection pooling to retrieve or put data.

Master: Maintains the metadata of who has what data/shard and which node. Also takes care of node failure and joining and rebalance operations. 

What is Foxtrot and its Infra Foot Print on Elasticsearch?

Elasticsearch Architecture For Foxtrot Application :

BM = Baremetal

Node/VM = Virtual Machine

In 1 BM we have 4 VMs/Nodes

What is a BM Aware?

As per above image ,

Data_Yellow ( Primary in BM21.Node1 ; Replica in BM43.Node4)

Data_Red ( Primary in BM87.Node2 ; Replica in BM81.Node3)

Replica data will be spread on a different BM than and not in its Primary’s BM.

Its a cluster level setting , mentioned like below. “allocation” : { “awareness” : { “attributes” : “rack_id” #rack_id is the BM_NUMBER# for us.

Why because chances of BM to go down is higher, and if primary and replica stays in different VMs

Story telling begins:

Let’s get into why the tool exists

Previously, we used Elasticsearch 7.10, and recent perf results for different versions gave us better perf results.

Verdict based on Perf: We had to perform in place upgrade of all clusters from 7.10 to 7.17 and then to 8.9 as there is no direct upgrade to 8.9 from 7.10.

A Look at our Working Architecture :

[The next section is followed by the details]

Outcome is: 

Details:

Automation Flow Explanation:

UpgradeTime - 2

  •  Input all details for the cluster that one wants to upgrade and verify if it’s all intact and make a note of what the cluster intact signal is going to look like.

UpgradeTime - 1 :

  • Ping from Salt Master to all its minions/nodes.
  • Perform Version checks to all nodes.
  • Perform config checks if it’s intact with git config of that cluster.

UpgradeTime :

What is Cluster intact Flow ?

  1. All nodes are in cluster as per data received UpgradeTime - 2 .
  2. Is Cluster green?
  3. (  initialize_shards < 10,  unassigned_shards = 0)
  4. active_shards_percent_as_number is 100%
  5. Are expected Nodes upgraded to Latest Version?
  6. If all above 5 are good, then give a Green signal.

What is Upgrade Execution Flow ?

   On Each VM

  • Download required package from in-house mirror.
  • Backup config files to safe location
  • Stop ES
  • Upgrade ES
  • Start ES

Journey Insights :

  • Written in Python.
  • It took cumulatively ~35 days to code and contain 3 Phases.
  • As of now 11 upgrades have been done successfully in prod with this module.
  • Rollback automation was required in stage environment for continuous testing.
  • Guardrails were implemented in such a way that at any time not more than 4VMs are down, which is OK since the cluster is BM aware.
  • Success rate of the upgrade is increased in the mapper phase or `@ UpgradeTime - 1` itself. i.e: before actual upgrade kicks off via salt ping checks, version and discrepancy checks.
  • Adoption to enable ES Upgrader for other team’s ES Clusters in org is possible. (In progress).
  • For each test cases/features/tasks, jira has details on why, what and how, so we don’t look back and try to find reasons for the same item again.