Architecture

Here’s everything you need to know about PhonePe’s Internal Cloud Provisioning service

Krishna Prasanth20 May, 2022

URL copied to clipboard

PhonePe Cloud, commonly referred to as PPEC, is PhonePe’s internal cloud provisioning service that has, with many years of learning, adapted to its internal network architecture, as well as the way teams are set up and execute their daily work.

PhonePe started with a simple, small and functional provisioning flow that used Mesos for Virtual Machine placement. It was operated by a few, who knew how the tool works and where to make changes in case of breakdowns using their tribal knowledge. However, as PhonePe grew, we added more and more servers to handle data/traffic and with the growth, more users started using the tool (including those who didn’t have prior knowledge to troubleshoot). Thus, the provisioning product had to evolve in order to adapt to how users seek stability, control, and reliability in the process. That’s when PPEC stepped in as a solution from the Tools team at PhonePe. It has been developed based on the understanding of how users use the tool and their expectations and internal architecture of both technology and network stack.

PPEC started with three main considerations:

The ability of the user to have a client that he/she can use across different regions that we serve for, as long as the client is authorized to do an activity
Audit the activity as well as address changelog scenarios
The abstraction that it needs to bring into place right from racking up the server to converting as an inventory to provisioning and subsequently retiring

PPEC is built taking these considerations into account and is in the process of addressing other scenarios to make the provisioning process even more efficient and user-friendly.

The tool has grown organically, adapting to various use cases, optimized to our work requirements, and eliminating the need for bells and whistles that may remain unused. In practicality, product feature usage within PPEC is at 100% and we intend to retain this number to ensure we don’t develop features that may burden or add less value to the user.

At a very high level, the users interface with the tool – either through the client or the UI – as seen in the flowchart below:

PPEC, as a tool, has multiple functions. Let’s dive into a few of them.

New Inventory Handling

PPEC has an OS called PIOUS (PhonePe Inventory Operational Unit System) that gets booted up, audits and “inventorizes” the asset. To inventorize, it compares the spec of the hardware with the spec the Purchase Order has, while also checking its health and other characteristics that persist on a network outside the production network. Once the comparisons and health status look fine, the asset is promoted, post which it is qualified to be used in production.

We have a nomenclature for identifying these newly racked up, promoted assets called SPAREs. SPAREs are raw entities waiting to be picked up by teams that have a requirement. The SPAREs are identified by different machine classes that help the user select the configuration of their choice. The PIOUS uses a kexec related trigger, used in the provisioning flow, that has a “maker-checker” enabled to load the OS of choice, which will be covered later.

In the above picture, a module IPA (IP Allocator) ensures the new inventory gets a non-production network and the same is used to promote the inventory to a production asset on validating all hardware as well as invoice integrity, programmatically.

Baremetal Provisioning

PPEC uses a provisioning flow that uses its own PIOUS to initialize any new asset in a very fast and reliable way that ensures hardware integrity as well as Inventory characteristics validation. This OS helps in transitioning an asset from a SPARE to a production worthy BAREMETAL that contains hardened images. A higher level flow is captured in the image below:

Virtual Machine Management

PPEC uses KVM along with libvirt + QEMU, using Mesos as a “placement” coordinator. A mesos cluster, that has all the participating baremetals as a team specific cluster, will be used to place VMs on those baremetals, through Mesos’ visibility. A higher level flow is captured in the image below:

We are in the process of migrating to a non-mesos, entirely algorithmic way of positioning the VMs based on the requirement the user has.

The PPEC functions explained above are executed through three mediums – PPEC Client, PPEC Dashboard and PPEC SDK.

PPEC Client

Client helps make requests to the PPEC ecosystem about the intent of the user. Here, the intent can be broadly classified as

Ability to know assets under a particular region – for a team or for a project
Ability to tag assets fluidly for easier discoverability
Ability to manage network characteristics of the team’s project
Cluster creation capabilities
Enhanced auditability of commands run
RBAC on the access users can have on the regions/assets or even finer details
Maker-Checker process to handle state changing commands
Manage IP and network configurations of the user’s teams/projects
Baremetal and VM provisioning
Seamless region switching in order to handle different environments

PPEC Dashboard

PPEC Dashboard provides a single place for the user to view details about assets across regions that the user has access to. There are UI notions that compliment the client, not entirely yet, however, work for the same is in progress. UI has a dashboard that almost mimics the Client and has the ability to switch across different regions, in order to access functionalities that relate to a specific region.

The dashboard is currently used in the following aspects:

Summary of the assets across regions as well as available space at the DC/rack level
Searching and filtering assets and viewing asset details
Setting up PPEC Client’s credentials depending on their own permissions. E.g., they could have RW access on a region, however, they would want to generate multiple credentials with either R access or RW access, so that they can use the least possible credentials with the PPEC client when doing an activity. This is similar to the spirit of when to use “sudo” and when not to, thereby giving an opportunity to the user to not execute a potentially state changing operation unintentionally.
Alert management on the assets
Documentation of the whole PPEC ecosystem, along with SDK as well as Swagger UI

PPEC SDK

The SDK helps in a programmatic way of interacting with PPEC APIs. These are usually used by automation use cases as well as monitoring scripts based on the roles the client_ids the configuration uses.

In a nutshell

Provisioning is a unique ecosystem that helps the developer improve skills, not just in programming, but also in systems, networking, observability and the whole CI/CD, covering as much breadth as possible with multiple vendors implementing multiple new technologies and abstracting features. With new frameworks to adapt to constantly, PPEC as a tool is always an evolving proposition and an exciting one at that.

If you’re interested in being a part of further improving the PPEC ecosystem or if you’re aligned with what PhonePe as a company does, we welcome you to apply for open positions at https://www.phonepe.com/careers/

Keep Reading

Architecture

Demystifying TStore: The Backbone of Billions of Transactions at PhonePe – Chapter 2

Arnab Bir and Tushar NaikAugust 08, 2024

Architecture

Demystifying TStore: The Backbone of Billions of Transactions at PhonePe

Arnab Bir and Tushar NaikAugust 06, 2024

Architecture

Managing Elasticsearch at scale at PhonePe – Part 3

Vashu ShivamJune 20, 2024