Engineering
PhonePe’s Server State Management via Senzu and PIOUS: An Overview
Surya M and Krishna Prasanth19 April, 2023
Following up on the blog about PhonePe’s Internal Cloud Provisioning service, this blog addresses inventory addition flow management that makes it easier for the Procurement team, Site Operations (Siteops) team and the Tools team to track assets coming into our data centers, their health and much more.
In the previous blog, we explained the overarching structure of PIOUS (PhonePe Inventory Operational Unit System). In this blog, we will dive into some of its aspects listed below – more specifically on how we maintain the server asset’s different states:
- Asset Discovery
- Asset Addition
- Inventory Evaluation
- Asset Firmware Updates
A high level visualization of the flows across our different domains are captured below:
We will walk through flows across those domains as well as the contexts below.
Procurement
Procurement is where all things begin. We have built a system called Chakshu (CXU) that takes care of all data center-related asset procurements, including capturing the context around the hardware order.
This tool is used to create different vendors and the associated hardware profiles through versioning. The versioning helps in capturing the variability of the configuration changes, which in essence captures the different generations of hardware.
These purchase requests create purchase orders internally, depending on the vendor + region (of deployment) combination, and the procurement team can break down the purchase orders to multiple delivery stages and update the order state over different transitions that they can go through, thus notifying different stakeholders and keeping them in loop.
Senzu
Senzu is another in-house utility that captures, stores and acts on the server asset. In order to efficiently operate, there is a Directed Acyclic Graph (DAG) model that implements guarantees on the asset’s state management.
It also takes care of maintaining consistency with server asset state – such as hardware and network related checks, auto tagging errors and firmware update orchestration. Most of the edges in the DAG are related to PIOUS <-> Senzu framework communication.
Each of the vertices in the graph are defined as a Bean in the Senzu framework, amounting to a well defined task. DAG defines the edges linking the beans to form a composite task. Some of the beans are as follows:
- Get the asset’s detail – is it a new inventory, a SPARE or a BAREMETAL
- Understand if the asset is under evaluation for Senzu to activate the flow further
- Validate with internal systems if the asset’s state requires updation in our backend to reflect the current state
- Check on the firmware states
- Power off if in case of a SPARE
Senzu is an orchestration layer that understands the complete state and configurations a server asset has to be at. Whenever PIOUS boots up, it checks with Senzu on what should be done as the next step. Senzu makes server asset transitions across different beans and instructs PIOUS to execute what is relevant. Since these are idempotent, every time PIOUS boots, it requests Senzu each time and Senzu transitions across beans again and applies what is relevant for that session, if any.
A high-level interaction between PIOUS and Senzu is captured in the infographic below. More explanation on some of these beans will follow in this article.
PIOUS
PIOUS (PhonePe Inventory Operational Unit System) is our own custom initrd image that helps in starting the Senzu’s flow and works with Senzu to get the asset to an ideal state. It helps in handling new inventories and in the asset’s lifecycle management, both are detailed below.
Asset Discovery
This process involves the Siteops engineer to deploy the delivered servers onto the racks. Before getting into the details, let’s understand how we have laid out the DHCP service in the network.
BMC network and Provisioning network have been broken into two distinct networks, one for the recognized assets and one for yet-to-be recognized assets. The recognized assets would mean that they are actively considered for production usage. Unrecognized assets are those that have been recently racked and have not passed the scrutiny and asset quality validation stages.
These unrecognized assets are allocated a smaller lease time – in minutes, so as to give them a chance to acquire a permanent lease once they “graduate” through the process, explained in the upcoming section. In addition to this, these unrecognized assets are instructed to PXE boot by default during reboots. PXE servers are configured to only serve PIOUS (our custom initrd image) for unrecognized assets via the provisioning network, which is nothing but our own custom Ubuntu, with certain optimizations, overrides and business logic – proficient in managing the inventory lifecycle.
The server asset, once racked, comes up on PIOUS OS through a short leased IP via DHCP, thus making PIOUS act as a vehicle for ASSET DISCOVERY and an entry point into our Asset Lifecycle Flow.
Upon booting, PIOUS loads up only the necessary kernel drivers and services. After a successful boot, PIOUS emits a BOOT_EVENT to Senzu (detailed later) and starts a whoami bean/task to identify at what process of the asset lifecycle the server is in and accordingly adapts to subsequent tasks, e.g., once identified as a new server, it starts with Server Health Verification which is a representation of system health as a whole.
System attributes, Network attributes like interface states, VLAN for provisioning, management network, LLDP information, hardware characteristics are captured and verified if they meet the criteria necessary for graduating as a production worthy entity. All this info is pushed to the backend, eventually accessed through the User Interface where all error states, if any, are highlighted.
PIOUS periodically keeps checking its attributes and states, whereby, it lets the engineer racking up the server know that the system is not in an expected state so that it can be rectified immediately. The instant feedback provided helps the engineer ascertain the nature of the issue and remediate the issue immediately. Further refresh of these checks would establish the changed nature of the attributes and accordingly reflect on the console for the engineer to get a feedback on, besides also using the User Interface that collates all data in a central place and that helps in coordinating the potentially manual remediation tasks.
This process helps us capture inconsistencies with our infrastructure earlier in the cycle. The feedback that PIOUS generates as a part of discovery and evaluation gives visibility into the efficiency of our processes and systems in-place and further enhances it. This also helps us establish the vendor configuration quality.
We also use this process to validate the Purchase Order that this asset is related to, automatically, if so the loop with CXU is closed, else, this is an unknown asset and will be alerted for the concerned parties to act upon.
To summarize, PIOUS helps us in:
- Validating the server hardware and network details
- Gives instant feedback to the Siteops team to remediate issues faster
- Closes the loop with the Purchase Order, that what is racked came through a procurement cycle
- Triggers different state checks through Senzu framework – explained later
Asset Lifecycle Management
The flow in the system that now has PIOUS is as follows – is more specifically for recognized server assets that have passed through scrutiny. Provisioning process, where a SPARE is picked to create a production BAREMETAL, goes through a flow as in the flowchart below:
To spend some detail on this, PIOUS periodically checks and updates the health of the machine and other attributes crucial for it to be termed a production-ready machine to the backend, and through that the alerting process gets invoked in case of bad health or configuration.
PIOUS has a server component that whitelists communication from only certain approved sources. When the server gets a request, it validates the asset on which it’s running, with the backend and then accordingly switches state from being a SPARE to a production BAREMETAL.
When a BAREMETAL gets deleted, probably because it’s not being actively used, etc., then the DHCP is updated accordingly and rebooted for it to be PXE’d back into PIOUS. Once it boots into PIOUS, it makes its checks and sends its latest machine stats to the backend and initiates a trigger on Senzu to go through the different steps.
Once Senzu is done with its work, PIOUS is instructed to shut itself down. The aspect of shutting down the server is to save on the power consumption. Even an idle non-production server consumes a decent amount of power, which we would like to avoid.
Firmware Updates
We have a requirement where all the production hardware, including the ones in production as well as the ones in SPARE can have their firmwares updated.
Two types of firmwares that we want to detail out are
- BMC
- BIOS
Other mechanisms of upgrading firmwares, such as for NICs etc. are also considered as part of this flow, but are not scoped out here.
BMC
Baseboard Management Controller (or BMC) is a module that sits on the motherboard that helps in administration of the server remotely. If not for the BMC, the site operations personnel would have to personally take a look at the server to accomplish activities, which is highly impractical these days with such a vast server count in the data centers.
BMC usually has an API support and a corresponding User Interface to see the state of the server that helps us take actions on it – e.g., power cycling a server, looking at the health states of different hardware, peripherals, etc.
BMC’s User Interface will help see the server characteristics visually, whereas the API support (Redfish API) offers a programmatic way to make a query or control the BMC. This helps us in maintaining the state of the cloud in a controlled and consistent manner.
BIOS
BIOS is a Basic Input/Output System, a small computer program that is stored in EPROM to help in managing start-up tasks or configurations when the server is powered on. Sometimes BMC UI will help in configuring BIOS options, so that post reboot, those options can become effective.
BMC and BIOS have their own firmwares and update process that upgrade through versions. These firmware versions are absolutely important to be maintained in our backend, as well as to figure out a process to upgrade en-masse if the call is to upgrade the versions. Reasons why we need to do this can be many:
- New firmwares could have security fixes that are crucial to apply
- New firmwares could be more performant than the older ones and have support for more configurations or options
Based on test candidates, we go with version upgrades. However, there are these possibilities that we deal with, from a product standpoint:
- Upgrade only BMC firmware, but not BIOS
- Upgrade only BIOS firmware, but not BMC
- Upgrade both BMC and BIOS to specific versions
A visual representation of the DAG is seen below in the Senzu
The activity CHECK_COMPONENTS_TO_UPGRADE ebbs and flows into the validation phases that check BMC and BIOS versions and has its own DAG to flow through in order to achieve the goal. Since this is repeatable, it won’t be a concern to repeat the DAG flow independently on encountering a temporary failure.
Using this framework, we can upgrade hundreds of firmwares parallelly, either by going en-masse or giving it an input on the candidates on which the upgrades should be done. There are appropriate checks and balances to ensure that only qualified hosts will go through the upgrade processes, especially for the ones that are in production.
Senzu can not only update BIOS or BMC firmware, it can also update other system firmwares as required following the same technique/approach. Senzu is configured in such a way that it takes management network capacity into consideration and also ensures the firmware update tasks are atomic. The integrity of a successful firmware update is promised by the BMC module and this is captured by Senzu.
Senzu acts as an abstraction layer above the firmware update feature (firmware inventory, firmware validation, firmware flash) supported by the BMC module and works lock-step with PIOUS to establish this interaction, as specified in Figure 3.
Summary
PIOUS as a solution, helps the team close the loop between the procurement process and asset deployment process, thereby giving visibility of the servers racked and purchase orders closed. Senzu helps in orchestrating the PIOUS part of the flow as well as beyond.
Besides this visibility, it helps the Siteops team get an upfront feedback about the health of the asset as well as the connectivity of the asset with respect to the network.
In addition, PIOUS helps in lifecycle management of migrating the SPARE into a production worthy BAREMETAL and vice-versa and if migrated into being a SPARE, shuts down the server in order to save power.
With the rapid pace of growth PhonePe is at, the infrastructure and the quality has to keep up with expectations. These solutions help us keep ourselves honest and enable PhonePe to adapt to using the hardware and the data center in a cost-effective manner and at the same time address the scale of PhonePe’s traffic growth.
As we mature and become more and more adept at this, we look forward to sharing our thoughts and approaches with the community on how we will go about accomplishing these tasks.
If you’re interested in being a part of further improving the PPEC ecosystem or if you’re aligned with what PhonePe as a company does, we welcome you to apply for open positions at https://www.phonepe.com/careers/