Lisa Leung

Ongoing - API Migration for Robot Fleet Management

Managing a high-stakes initiative: mitigating privacy and maintenance risks, aligning stakeholders with the customer experience, and sequencing a reliable rollout

This project is in development and does not represent my employer.

To test my systems design and technical skills, I strategized a migration of legacy, on-premises services to be containerized in a multi-region cloud environment with 99.999% service availability.

At the time, research on robot fleet management showed that legacy APIs at the enterprise level struggled with:

  1. Resolutions to critical incidents. Immediate resolution was dependent on individuals, instead of systems. Long-term resolution would lead to lengthy re-designs and cripple product roadmaps.

  2. Thorough system testing by service. The use of older languages and paradigms resulted in a monolithic application design, resulting in difficulty with changes to the business logic and points.

  3. Service reliability and availability. Due to many services still remaining on-premises, any server outages would result in slow or failed API calls, with catastrophic consequences on the perceived reliability of the service.

  4. Latency. With several health checks and calls across multiple regions, a single API call would have a P50 of 1 second, compared to the industry standard of 200ms or less (in other words, there should be at least a 50% probability that an API call will require 1 second or more).

  5. Data integrity. Given the historical cost of data storage and messaging services, many services connected directly to the secondary sources of truth.

Problem

How might we increase service availability and reliability for critical customers who rely on the service as a source of truth?

(Anticipated) Outcome

  • Intermediate optimizations to the on-premises service: health checks, rules for failover, refactoring.

  • Long-term approach to meet customer needs: containerize the service and migrate to an active-active, multi-region cloud architecture; utilizing original sources of truth.

  • Realignment with all related applications, business stakeholders, and producers and consumers of the data for end-to-end experience.

AI Notice: this project utilized Claude for high-level architecture visualizations and API specifications.

Managing complexity

Stakeholder Management

Managing views from the perspective of different stakeholders to improve speed of decision-making and provide clarity on the impact of project risks.

Vision Alignment

Crafting project artifacts to communicating key decisions impacting customers and guiding future work.

Validating Architecture with Sources of Truth

Iterating on architecture based on findings at the code level; diving into technical documentation and schemas.

Technology Change Management

Understanding transferrable use cases for a new technology and building structures for long-term support.

Designing for future security and reliability

At a high level, the main challenges for the migration from on-prem to cloud would be:

  • Managing high throughput, particularly from the robot fleet. An estimated 1,440 health checks (1 per minute) would need to be made per minute per robot; one given robot may require over 10,000 calls to the core layer (business logic for fleet management).

  • Data streaming and latency, given the time required to retrieve and share geographical and sensor data.

Approximation of a current-state on-premises architecture.

High-level cloud architecture (multi-region, active-active).

Rough API specifications for IAM, basic health checks, and event logging.

Aligning application and engineering stakeholders around feasibility and customer outcomes

Desirability, Feasibility, Viability: based on this IDEO framework, I assessed all possible solutions.

  • Desirability: determine end-user and customer outcomes, refer to service-level agreements (SLAs) as needed

  • Feasibility: remove obstacles for robust implementation and set realistic expectations based on system dependencies

  • Viability: achieve organizational goals, align with financial benchmarks, and account for opportunity costs

Based on the organization's principles (customer trust and reliability), I focused the team's efforts on the following outcomes:

  • Increasing service reliability to address the impact of server outages (current state)

  • Increasing service availability to empower customers with 24/7 access

  • Improving application infrastructure for faster troubleshooting and testing

From comparable work, I generated the following OKR.

Objective

Successfully implement a cloud architecture for a legacy on-premises system

Key Results

  1. Reduce service downtime with up-to-date load balancers and 5 new metrics used in health monitoring systems

  2. Achieve 99.99% availability by reducing application startup times from over 30 seconds to less than 5 seconds

  3. Overhaul technical debt by converting the application to the newest (stable) version of its programming language and tools

  4. Reference all source systems of data

To manage the complex system and stakeholder needs, I created communication matrices, risk logs, presentations, technical documentation and specific communications tailored to applications, database, networking/infrastructure teams.

The table below illustrates my perspective on the various stakeholder groups and priorities that needed to be addressed.

Stakeholder

Priorities

Needs

C-Suite

Achievement of organizational goals

  • Risk management (org-level)

Leadership
(Director level and above)

Achievement of department-wide goals and ability to meet customer need

  • Expected use of organizational resources (time, cash, personnel)

  • Risk management (department/org-level)

Application Management (Self)

Achievement of application goals and ability to meet high-priority asks

  • Delivery of architecture and end-to-end functionality

  • Relationship management

  • Risk management (project/application-level)

IT Operations and Security Management

Managing high-level application operations and risks to the department

  • Impacts to application status

  • Exposure of PII/PCI or other proprietary data

  • Risk management (legal, data, compliance)

Architecture and Data Management

Ensuring the reliability and integrity of the organization's architecture

  • Impacts to architecture design and enterprise data usage

Technical Procurement and Contract Management

Managing relationships with vendors and risks to the organization's business strategy

  • Impacts to current/new relationships with vendors

Finance

Managing financial risk to the organization

  • Impacts to resource usage and cashflow

Networking and Infrastructure Management

Ensuring the security and robustness of the organization's systems

  • Exposure to any and all third-parties

  • Impacts to architecture design

Upstream and Downstream Applications

Achievement of application goals and ability to meet high-priority asks

  • Impacts to application work and operations

  • Risk management (project/application-level)

Service and API Users

Ability to utilize services/APIs to achieve goals

  • Impacts to Service Level Agreements (SLAs)

  • End user and (customer) experience

Crafting systems to support cross-functional development work

I drove collaboration and led decisions on architecture, migration planning through the following activities:

  • I crafted visualizations to communicate deep dives into the architecture

  • I evaluated data fields and business logic at the code level

  • I documented the current and future state schemas

  • I facilitated conversations across networking, IP, database, CI/CD, and application stakeholders

Updates to come.

Readings that guided my approach

๐Ÿ“ฃ Cold emails welcome!

Iโ€™m always happy to have a coffee chat โ˜•, answer burning questions ๐Ÿ”ฅ, or learn something new ๐Ÿ“–.

๐Ÿ“ฃ Cold emails welcome!

Iโ€™m always happy to have a coffee chat โ˜•, answer burning questions ๐Ÿ”ฅ, or learn something new ๐Ÿ“–.