Lisa Leung

Ongoing - API Migration for Robot Fleet Management

Managing a high-stakes initiative: mitigating privacy and maintenance risks, aligning stakeholders with the customer experience, and sequencing a reliable rollout

To systems design and technical knowledge to the test, I strategized a migration of legacy, on-premises services to be containerized in a multi-region cloud environment with 99.99% service availability.

At the time, research on robot fleet management showed that legacy APIS at the enterprise level struggled with:

  1. Resolutions to critical incidents. Immediate resolution was dependent on individuals, instead of systems. Long-term resolution would lead to lengthy re-designs and cripple service roadmaps.

  2. Thorough system testing by service. The use of older languages and paradigms resulted in a monolithic application design, resulting in difficulty with changes to the business logic and points.

  3. Service reliability and availability. Due to many services still remaining on-premises, any server outages would result in slow or failed API calls, with catastrophic consequences on the perceived reliability of the service.

  4. Latency. With several health checks and calls across multiple regions, a single API call would have a p50 of 1 second, compared to the industry standard of 200ms or less.

  5. Data integrity. Given the historical cost of data storage and messaging services, many services connected directly to the secondary sources of truth.

Note: this project is in-development and does not represent my employer.

Problem

How might we increase service availability and reliability for critical customers who rely on the service as a source of truth?

(Anticipated) Outcome

  • Intermediate optimizations to the on-premises service: health checks, rules for failover, refactoring.

  • Long-term approach to meet customer needs: containerize the service and migrate to an active-active, multi-region cloud architecture; utilizing original sources of truth.

Managing complexity

Stakeholder Management

Managing views from the perspective of different stakeholders to improve speed of decision-making and provide clarity on the impact of project risks.

Vision Alignment

Crafting project artifacts to communicating key decisions impacting customers and guiding future work.

Validating Architecture with Sources of Truth

Iterating on architecture based on findings at the code level; diving into technical documentation and schemas to deeply understand the current state and downstream impacts.

Technology Change Management

Understanding transferrable use cases for a new technology and building structures for long-term support.

Establishing stakeholder-specific views

To manage the complex system and stakeholder needs, I created communication matrices, risk logs, presentations, technical documentation and specific communications tailored to applications, database, networking/infrastructure teams.

The table below illustrates my perspective on the various stakeholder groups and priorities that needed to be addressed.

Stakeholder

Priorities

Needs

Method of Communication

CxO

Achievement of organizational goals

  • Risk management (org-level)

  • Weekly/Monthly reports from leadership

Leadership (Director level and above)

Achievement of department-wide goals and ability to meet customer need

  • Expected use of organizational resources (time, cash, personnel)

  • Risk management (department/org-level)

  • Weekly Reports

  • Email on major decisions and risks after escalation to Application Management

  • Critical Incident management

Application Management (Self)

Achievement of application goals and ability to meet high-priority asks

  • Delivery of architecture and end-to-end functionality

  • Relationship management

  • Risk management (project/application-level)

  • Any and all potential impacts to the scope of work

  • Any related incidents

IT Operations and Security Management

Managing high-level application operations and risks to the department

  • Impacts to application status

  • Exposure of PII/PCI or other proprietary data

  • Risk management (legal, data, compliance)

  • Email on changes to application status or risk to the organization

  • Any related incidents

Architecture and Data Management

Ensuring the reliability and integrity of the organization's architecture

  • Impacts to architecture design and enterprise data usage

  • Any and all potential impacts to the planned architecture

  • Any and all potential impacts to the data used (source)

Technical Procurement and Contract Management

Managing relationships with vendors and risks to the organization's business strategy

  • Impacts to current/new relationships with vendors

  • Email on acquisitions and updates for vendors/tools

Finance

Managing financial risk to the organization

  • Impacts to resource usage and cashflow

  • Email with financial updates after budget approval

Networking and Infrastructure Management

Ensuring the security and robustness of the organization's systems

  • Exposure to any and all third-parties

  • Impacts to architecture design

  • Any and all potential impacts to the planned architecture

  • Any and all potential impacts to the databases

  • Any related incidents

Upstream/Downstream Applications

Achievement of application goals and ability to meet high-priority asks

  • Impacts to application work and operations

  • Risk management (project/application-level)

  • Relevant impacts to the planned architecture

  • Relevant impacts to their scope of work

  • Relevant impacts to data required

Service/API Users

Ability to utilize services/APIs to achieve goals

  • Impacts to Service Level Agreements (SLAs)

  • End user and (customer) experience

  • Major changes, outages, and defects

  • Modifications to endpoint(s)

Aligning application and engineering stakeholders around feasibility and customer outcomes

Desirability, Feasibility, Viability: based on this IDEO framework, I assessed all possible solutions.

  • Desirability: determine end-user and customer outcomes, referring to service-level agreements (SLAs) as needed

  • Feasibility: removing obstacles for robust implementation and setting realistic expectations based on system dependencies

  • Viability: achieves organizational goals, aligns with financial benchmarks, and accounting for opportunity costs

Based on the organization's principles (customer trust and reliability), I focused the team's efforts on the following outcomes:

  • Increasing service reliability to address the impact of server outages (current state)

  • Increasing service availability to empower customers with 24/7 access

  • Improving application infrastructure for faster troubleshooting and testing

From comparable work, I generated the following OKR:

Objective: Successfully implement a cloud architecture for a legacy on-premises system

  1. Reduce service downtime with up-to-date load balancers and 5 new metrics used in health monitoring systems

  2. Achieve 99.99% availability by reducing application startup times from over 30 seconds to less than 5 seconds

  3. Overhaul technical debt by converting the application to the newest (stable) version of its programming language and tools

  4. Reference all source systems of data

Crafting systems to support cross-functional development work

I drove collaboration and led decisions on architecture, migration planning through the following activities:

  • I crafted visualizations to communicate deep dives into the architecture

  • I evaluated data fields and business logic at the code level

  • I documented the current and future state schemas

  • I facilitated conversations across networking, IP, database, CI/CD, and application stakeholders


Updates to come.

📣 Cold emails welcome!

I’m always happy to have a coffee chat ☕, answer burning questions 🔥, or learn something new 📖.