Ongoing - API Migration for Robot Fleet Management
Managing a high-stakes initiative: mitigating privacy and maintenance risks, aligning stakeholders with the customer experience, and sequencing a reliable rollout
To systems design and technical knowledge to the test, I strategized a migration of legacy, on-premises services to be containerized in a multi-region cloud environment with 99.99% service availability.
At the time, research on robot fleet management showed that legacy APIS at the enterprise level struggled with:
Resolutions to critical incidents. Immediate resolution was dependent on individuals, instead of systems. Long-term resolution would lead to lengthy re-designs and cripple service roadmaps.
Thorough system testing by service. The use of older languages and paradigms resulted in a monolithic application design, resulting in difficulty with changes to the business logic and points.
Service reliability and availability. Due to many services still remaining on-premises, any server outages would result in slow or failed API calls, with catastrophic consequences on the perceived reliability of the service.
Latency. With several health checks and calls across multiple regions, a single API call would have a p50 of 1 second, compared to the industry standard of 200ms or less.
Data integrity. Given the historical cost of data storage and messaging services, many services connected directly to the secondary sources of truth.
Note: this project is in-development and does not represent my employer.
Problem
How might we increase service availability and reliability for critical customers who rely on the service as a source of truth?
(Anticipated) Outcome
Intermediate optimizations to the on-premises service: health checks, rules for failover, refactoring.
Long-term approach to meet customer needs: containerize the service and migrate to an active-active, multi-region cloud architecture; utilizing original sources of truth.
Managing complexity
Stakeholder Management
Managing views from the perspective of different stakeholders to improve speed of decision-making and provide clarity on the impact of project risks.
Vision Alignment
Crafting project artifacts to communicating key decisions impacting customers and guiding future work.
Validating Architecture with Sources of Truth
Iterating on architecture based on findings at the code level; diving into technical documentation and schemas to deeply understand the current state and downstream impacts.
Technology Change Management
Understanding transferrable use cases for a new technology and building structures for long-term support.
Establishing stakeholder-specific views
To manage the complex system and stakeholder needs, I created communication matrices, risk logs, presentations, technical documentation and specific communications tailored to applications, database, networking/infrastructure teams.
The table below illustrates my perspective on the various stakeholder groups and priorities that needed to be addressed.
Stakeholder | Priorities | Needs | Method of Communication |
|---|---|---|---|
CxO | Achievement of organizational goals |
|
|
Leadership (Director level and above) | Achievement of department-wide goals and ability to meet customer need |
|
|
Application Management (Self) | Achievement of application goals and ability to meet high-priority asks |
|
|
IT Operations and Security Management | Managing high-level application operations and risks to the department |
|
|
Architecture and Data Management | Ensuring the reliability and integrity of the organization's architecture |
|
|
Technical Procurement and Contract Management | Managing relationships with vendors and risks to the organization's business strategy |
|
|
Finance | Managing financial risk to the organization |
|
|
Networking and Infrastructure Management | Ensuring the security and robustness of the organization's systems |
|
|
Upstream/Downstream Applications | Achievement of application goals and ability to meet high-priority asks |
|
|
Service/API Users | Ability to utilize services/APIs to achieve goals |
|
|
Aligning application and engineering stakeholders around feasibility and customer outcomes
Desirability, Feasibility, Viability: based on this IDEO framework, I assessed all possible solutions.
Desirability: determine end-user and customer outcomes, referring to service-level agreements (SLAs) as needed
Feasibility: removing obstacles for robust implementation and setting realistic expectations based on system dependencies
Viability: achieves organizational goals, aligns with financial benchmarks, and accounting for opportunity costs
Based on the organization's principles (customer trust and reliability), I focused the team's efforts on the following outcomes:
Increasing service reliability to address the impact of server outages (current state)
Increasing service availability to empower customers with 24/7 access
Improving application infrastructure for faster troubleshooting and testing
From comparable work, I generated the following OKR:
Objective: Successfully implement a cloud architecture for a legacy on-premises system
Reduce service downtime with up-to-date load balancers and 5 new metrics used in health monitoring systems
Achieve 99.99% availability by reducing application startup times from over 30 seconds to less than 5 seconds
Overhaul technical debt by converting the application to the newest (stable) version of its programming language and tools
Reference all source systems of data
Crafting systems to support cross-functional development work
I drove collaboration and led decisions on architecture, migration planning through the following activities:
I crafted visualizations to communicate deep dives into the architecture
I evaluated data fields and business logic at the code level
I documented the current and future state schemas
I facilitated conversations across networking, IP, database, CI/CD, and application stakeholders
Updates to come.


