Engineering

Virtual Machine Provisioning and Management in PhonePe

Umed Jadhav, Pratik Anurag and Krishna Thatta Prasanth23 December, 2024

URL copied to clipboard

This blog post provides a high-level overview of PPeC (PhonePe Cloud) Agent’s architecture, its interaction with other key components like the PPeC API and PPeC Proxy, and how it optimizes tasks such as VM creation, resource allocation, and dynamic disk management.

Evolution and Need for PPeC Agent

Initially, the team used solutions like Mesos, Zookeeper, and Beanstalk, combined with a simple orchestrator, to manage basic VM tasks. Though this solution worked for smaller use cases and scale, this setup introduced complexity and inefficiency as demand and usage grew. Leveraging lessons learned from this experience, PPeC Agent was designed to streamline processes and support evolving infrastructure needs.

Current Scale: PPeC Agent manages bare-metal compute resources of around 350,000 cores and 3 PB of memory (ballpark number).

Performance: The average time for creating a VM is approximately 7 seconds, with support for batched creation and multiple placement policies.

High-Level Architecture and Workflow

The diagram provides a high-level overview of a typical user interaction workflow for managing the lifecycle of virtual machines (VMs) through the PPeC Agent.

The process begins with an authenticated user initiating tasks through the PPeC CLI, which communicates securely with the PPeC API.
PPeC API handles candidate selection and resource allocation, ensuring optimal VM placement.
PPeC Proxy validates the PPeC API requests, checking for authentication and input integrity before forwarding the tasks via a Unix Domain Socket to the PPeC Agent.
The Agent, acting as the executor, performs the required actions on baremetal, such as VM creation, disk management, and NUMA tuning.

This seamless pipeline ensures efficient and secure VM provisioning while maintaining system performance and consistency.

Overview of Components

1. PPeC API

PPeC API works as an orchestrator and plays a central role in managing the entire VM lifecycle, acting as the decision-making engine for resource allocation and candidate selection. It coordinates with various services to ensure optimal placement and efficient utilization of resources across bare-metal systems. The key responsibilities of PPeC API are:

Candidate Selection and Resource Allocation

PPeC API evaluates available resources across multiple bare-metal hosts to select the most suitable candidate for VM placement. It ensures that the chosen host meets the requested VM’s CPU, memory, and storage requirements while balancing the overall system load.

Integration with IPDB Services

For network configuration, the PPeC API interacts with IPDB services to assign IP addresses and manage network resources. This seamless integration automates network provisioning and reduces manual intervention.

Provisioning Requests

Once the resource and network configurations are finalized, the PPeC API sends provisioning requests to the PPEC Proxy, validating them before forwarding them to the PPEC Agent for execution.

High-Availability and Resilience

PPeC API uses the data synced by ppec-agent to get a near-real time view of the resource allocation of all baremetals which enables it to make accurate provisioning decisions and improve VM operation accuracy. PPeC Agent, when performing the activity, refreshes its own view of the state to ensure that the candidate is indeed the right selection, before undertaking the activity.

2. PPeC Proxy

The PPeC Proxy functions as a high-throughput entry point, designed to enforce security, validate input, and facilitate seamless communication between various components of the PPeC system. Its role goes beyond simply forwarding requests—it ensures that each request adheres to strict access control, data integrity, and communication protocols, creating a robust foundation for VM lifecycle operations.
Below are its key responsibilities:

Client Authentication and Ringfencing
The PPeC Proxy is tasked with verifying the authenticity of clients attempting to initiate operations. This is achieved through the use of certificates and a security mechanism known as ringfencing, which restricts access to a predefined set of IP addresses.
By enforcing such IP restrictions, the PPeC Proxy not only minimizes exposure to untrusted traffic but also simplifies the auditing process, enabling traceability and enhancing security oversight.
Essentially, only trusted clients within the allowed subnet can initiate interactions with the Proxy, reducing the attack surface and potential security risks.
Request Validation and Data Type Enforcement
Before any request is forwarded to the PPeC Agent, the PPeC Proxy rigorously validates the request parameters and data types.
It ensures that each request conforms to the service contract established with the PPeC API, maintaining a stable and predictable interface for the PPeC API . This validation step is crucial in safeguarding the system against malformed or erroneous requests, preserving system integrity and consistency in data handling.
Communication via Unix Domain Socket (UDS)
The PPeC Proxy communicates with the PPeC Agent through a Unix Domain Socket (UDS), a mechanism optimized for fast and efficient communication between processes on the same host.
UDS provides several advantages over traditional network-based communication, including reduced latency and lower overhead.
Additionally, its use enhances security by limiting access to privileged processes, such as the Proxy itself and root-level services.
This ensures that only authorized components can interact with the PPeC Agent, further reinforcing the security model.
Payload Parsing and Response Transformation
One of the critical roles of the PPeC Proxy is to handle the conversion of communication formats between the external client and the internal agent. Incoming HTTP requests are parsed and converted into raw byte streams, which are then transmitted to the PPeC Agent over the UDS.
Upon receiving a response from the agent, the Proxy translates the raw byte data back into an HTTP response, which is then sent to the PPeC API.
This dual-stage parsing and formatting process ensures that both the client-facing and backend components can operate using their native protocols while maintaining a seamless data flow.

By integrating these capabilities, the PPeC Proxy acts as a secure and efficient intermediary, balancing throughput, security, and data integrity across the entire VM provisioning and management lifecycle.

3. PPeC Agent

The PPeC Agent plays a critical role as the execution engine responsible for orchestrating a wide range of operations on bare-metal servers. From provisioning virtual machines (VMs) to optimizing system performance, the PPeC Agent ensures seamless automation and resource management, enabling efficient and predictable performance across the infrastructure. Below is a deeper dive into its core responsibilities and operational nuances:

VM Management: Lifecycle Operations and Resource Optimization
One of the primary tasks of the PPeC Agent is managing the lifecycle of VMs on bare-metal systems. This includes automating deployment, configuration, and ongoing resource adjustments to maintain system balance and avoid bottlenecks. Through the libvirt APIs, the PPeC Agent automates all critical VM actions, including creation, modification, and teardown.

Before initiating any operation, the PPeC Agent assesses resource availability and manages concurrency by leveraging its in-built resource optimization logic.
This validation step ensures that requested actions are feasible, enhances success rates, and promptly identifies any constraints that could hinder execution. Key tasks performed during VM management include:
- Deployment of VMs with specific operating systems, CPU cores, and memory configurations, ensuring that the overall resource pool remains balanced.
- Automation of essential tasks, such as handling cloud-init configurations, tuning NUMA (Non-Uniform Memory Access), and managing dynamic storage attachment or detachment.
- The PPeC Agent uses an in-memory lockfile to manage concurrency, tracking in-progress VM operations. The lockfile is split into two sections:
  - General Resource Allocation: Tracks aggregate CPU, memory, and disk usage across the system.
  - Detailed Resource Placement: Maintains a fine-grained view of allocated resources.

This separation reduces contention in the critical section, enabling the system to process concurrent requests more efficiently without sacrificing accuracy.

Dynamic Disk Attachment and Management
The PPeC Agent facilitates on-the-fly disk attachment and detachment for VMs, supporting two primary types of disks:
- PCI Passthrough NVMe Disks
  These provide direct access to physical NVMe drives, delivering near-native performance ideal for I/O-intensive workloads. The PPeC Agent employs a memory buffer mechanism to manage temporary memory ballooning during NVMe disk attachments. This approach ensures that large VMs stabilize before user interactions, maintaining overall system reliability and minimizing performance disruptions.
- QEMU-Based Storage Disks
  For more flexible storage options, the PPeC Agent leverages QEMU to create virtual storage images. The Agent optimizes disk preallocation to reduce the time needed for attaching large storage volumes, ensuring that storage operations are predictable and efficient.
On-Demand NUMA Tuning: Optimizing Memory Access for VMs
NUMA is an architecture where processors have local memory, enabling faster memory access compared to accessing remote memory from other processors. Proper NUMA alignment can significantly reduce latency and avoid cross-node communication bottlenecks.
During VM provisioning, the PPeC Agent carefully analyzes the NUMA topology and leverages libvirt’s NUMA placement capabilities to fine-tune resource allocation. This ensures that VMs are assigned memory and CPU resources that are physically close, thereby minimizing memory access delays. Key highlights include:
- Dynamic vCPU Pinning: The Agent dynamically pins virtual CPUs (vCPUs) to optimize NUMA performance and ensure efficient hardware utilization.
- Resource Allocation Strategy: The Agent follows a greedy algorithm that uses the following resource allocation rules:
  - No VM can have more vCPUs than the highest core count per socket.
  - VMs with an odd number of vCPUs are not allowed.
  - The total number of vCPUs allocated to all VMs must not exceed the total CPUs available on the baremetal, minus a reserved buffer.

By sorting VMs in descending order of core count and pinning them to available sockets, the algorithm ensures optimal utilization of CPU resources while avoiding memory latency issues.

Periodic Health Checks and Baremetal Resource Tracking
The PPeC Agent periodically collects health metrics from the bare-metal system and syncs this data with the PPeC API. These health checks include:
- CPU and Memory Layout: Capturing in-depth information about the current state of CPU and memory usage.
- VM Details: Reporting on all active VMs, including their configurations and resource utilization.
- Hardware and Disk Layout: Providing detailed insights into the hardware components and disk allocation.

This continuous monitoring allows the PPeC API to maintain an up-to-date view of the bare-metal infrastructure, enabling more informed decisions for resource allocation and candidate selection.
Additionally, the PPeC Agent tracks storage and compute resource allocation in real-time.
By monitoring VM status and detecting discrepancies, it can issue alerts, helping administrators address potential issues before they escalate.

Operational Improvements and Enhanced Efficiency

By refining the interaction model between the PPeC API and the PPeC Agent, we have significantly streamlined operational processes. This redesign leverages insights gained from previous iterations, enabling a more efficient, reliable, and scalable system. The simplified interaction framework has not only reduced the complexity of communications but also minimized the manual overhead required for routine tasks, thereby substantially lowering our operational load.

A key outcome of these improvements is an almost 100% VM provisioning success rate, reflecting the robustness and accuracy of the updated workflows. The enhanced automation and validation mechanisms within the PPeC Agent ensure that VM creation requests are handled efficiently, with minimal risk of failure.

Additionally, the turnaround time for troubleshooting and resolving VM creation failures has seen significant reductions. This improvement is attributed to the introduction of advanced debugging tools accessible via the command-line interface (CLI). These tools allow engineers to quickly diagnose and address issues, enhancing overall system reliability and reducing downtime.

To maintain compliance and consistency across the infrastructure, the PPeC Agent performs daily reconciliation with the PPeC API. During this process, it cross-verifies the state of all VMs, identifying any instances created outside the approved workflows. When discrepancies are detected, the system automatically alerts relevant stakeholders, ensuring timely intervention and alignment with established governance protocols. This proactive monitoring mechanism not only reinforces security and compliance but also fosters operational consistency across the entire environment.

Keep Reading

Engineering

Vulnerability Management & Risk Scoring for Business Services

Arnab RoyJanuary 22, 2025

Engineering

All an SRE needs to know : Automation ERA in Distributed Datastores

Core Contributors : Merwin Joseph Biby, Ritik Singhal & Mannoj SaravananDecember 24, 2024

Engineering

Nimbus: Flexible BareMetal Provisioning

Tools (SRE) and Infra (SRE) teamJuly 25, 2024