PhonePe Cloud, commonly referred to as PPEC, is PhonePe’s internal cloud provisioning service that has, with many years of learning, adapted to its internal network architecture, as well as the way teams are set up and execute their daily work.
PhonePe started with a simple, small and functional provisioning flow that used Mesos for Virtual Machine placement. It was operated by a few, who knew how the tool works and where to make changes in case of breakdowns using their tribal knowledge. However, as PhonePe grew, we added more and more servers to handle data/traffic and with the growth, more users started using the tool (including those who didn’t have prior knowledge to troubleshoot). Thus, the provisioning product had to evolve in order to adapt to how users seek stability, control, and reliability in the process. That’s when PPEC stepped in as a solution from the Tools team at PhonePe. It has been developed based on the understanding of how users use the tool and their expectations and internal architecture of both technology and network stack.
PPEC started with three main considerations:
- The ability of the user to have a client that he/she can use across different regions that we serve for, as long as the client is authorized to do an activity
- Audit the activity as well as address changelog scenarios
- The abstraction that it needs to bring into place right from racking up the server to converting as an inventory to provisioning and subsequently retiring
PPEC is built taking these considerations into account and is in the process of addressing other scenarios to make the provisioning process even more efficient and user-friendly.
The tool has grown organically, adapting to various use cases, optimized to our work requirements, and eliminating the need for bells and whistles that may remain unused. In practicality, product feature usage within PPEC is at 100% and we intend to retain this number to ensure we don’t develop features that may burden or add less value to the user.
At a very high level, the users interface with the tool – either through the client or the UI – as seen in the flowchart below:
PPEC, as a tool, has multiple functions. Let’s dive into a few of them.
New Inventory Handling
PPEC has an OS called PIOUS (PhonePe Inventory Operational Unit System) that gets booted up, audits and “inventorizes” the asset. To inventorize, it compares the spec of the hardware with the spec the Purchase Order has, while also checking its health and other characteristics that persist on a network outside the production network. Once the comparisons and health status look fine, the asset is promoted, post which it is qualified to be used in production.
We have a nomenclature for identifying these newly racked up, promoted assets called SPAREs. SPAREs are raw entities waiting to be picked up by teams that have a requirement. The SPAREs are identified by different machine classes that help the user select the configuration of their choice. The PIOUS uses a kexec related trigger, used in the provisioning flow, that has a “maker-checker” enabled to load the OS of choice, which will be covered later.
In the above picture, a module IPA (IP Allocator) ensures the new inventory gets a non-production network and the same is used to promote the inventory to a production asset on validating all hardware as well as invoice integrity, programmatically.
PPEC uses a provisioning flow that uses its own PIOUS to initialize any new asset in a very fast and reliable way that ensures hardware integrity as well as Inventory characteristics validation. This OS helps in transitioning an asset from a SPARE to a production worthy BAREMETAL that contains hardened images. A higher level flow is captured in the image below:
Virtual Machine Management
PPEC uses KVM along with libvirt + QEMU, using Mesos as a “placement” coordinator. A mesos cluster, that has all the participating baremetals as a team specific cluster, will be used to place VMs on those baremetals, through Mesos’ visibility. A higher level flow is captured in the image below:
We are in the process of migrating to a non-mesos, entirely algorithmic way of positioning the VMs based on the requirement the user has.
The PPEC functions explained above are executed through three mediums – PPEC Client, PPEC Dashboard and PPEC SDK.
Client helps make requests to the PPEC ecosystem about the intent of the user. Here, the intent can be broadly classified as
- Ability to know assets under a particular region – for a team or for a project
- Ability to tag assets fluidly for easier discoverability
- Ability to manage network characteristics of the team’s project
- Cluster creation capabilities
- Enhanced auditability of commands run
- RBAC on the access users can have on the regions/assets or even finer details
- Maker-Checker process to handle state changing commands
- Manage IP and network configurations of the user’s teams/projects
- Baremetal and VM provisioning
- Seamless region switching in order to handle different environments
PPEC Dashboard provides a single place for the user to view details about assets across regions that the user has access to. There are UI notions that compliment the client, not entirely yet, however, work for the same is in progress. UI has a dashboard that almost mimics the Client and has the ability to switch across different regions, in order to access functionalities that relate to a specific region.
The dashboard is currently used in the following aspects:
- Summary of the assets across regions as well as available space at the DC/rack level
- Searching and filtering assets and viewing asset details
- Setting up PPEC Client’s credentials depending on their own permissions. E.g., they could have RW access on a region, however, they would want to generate multiple credentials with either R access or RW access, so that they can use the least possible credentials with the PPEC client when doing an activity. This is similar to the spirit of when to use “sudo” and when not to, thereby giving an opportunity to the user to not execute a potentially state changing operation unintentionally.
- Alert management on the assets
- Documentation of the whole PPEC ecosystem, along with SDK as well as Swagger UI
The SDK helps in a programmatic way of interacting with PPEC APIs. These are usually used by automation use cases as well as monitoring scripts based on the roles the client_ids the configuration uses.
In a nutshell
Provisioning is a unique ecosystem that helps the developer improve skills, not just in programming, but also in systems, networking, observability and the whole CI/CD, covering as much breadth as possible with multiple vendors implementing multiple new technologies and abstracting features. With new frameworks to adapt to constantly, PPEC as a tool is always an evolving proposition and an exciting one at that.
If you’re interested in being a part of further improving the PPEC ecosystem or if you’re aligned with what PhonePe as a company does, we welcome you to apply for open positions at https://www.phonepe.com/careers/