Amazon payments service was the API gateway that enabled internal clients to leverage full-fledged payment related features without worrying about the underlying instrumentation, such as integration with external third party partners. Payments service redesign followed microservice design patterns and payment features were implemented via orchestration of components and services that could be owned and developed by downstream teams. Payment features, including charge, disbursement, refund, etc. were exposed to clients as restful resources. Clients invoked the resources through API after authentication.
Problem Statement
With more and more teams onboarded to the service, a lot of orchestration logic was added to handle client specific requirements. The service gradually turned into a monolith where boundaries of components became blur and born the risk of single point of failure.
Design
Orchestration as a Service
Since payment features were essentially orchestrations of components and services, they could be refactored into an independent library and hosted separately. Payments service only needed to invoke the new service to fulfil a specific feature and thus became a lightweight proxy that allowed it to focus more on other aspects of the system, such as authentication, throttling, monitoring, logging, etc.
Common Functionality
Non functional components such as monitoring and logging were still managed by the original payments service and redesigned utilizing the side car design pattern. This promoted the separation of concerns by offloading non-core functionality from the main application. This allowed the main application to focus on its primary responsibilities while delegating secondary tasks to the side car component.
Integration with Orchestration Services
The feature specific orchestration service might expose different interfaces than the original payments service. Adapter design pattern was applied to bridge the gaps between the target interface and the adaptee that was the orchestration service.
Migration
To allow migration of traffic from payments service to orchestration services, dynamic configuration built on top of remote distributed cache was leveraged. The percentage of traffic migrated was slowly increased if there was no issues discovered. After all traffic was migrated, the payments service became lightweight microservice and the scalability, reliability and robustness of the service improved greatly.
Hardware Utilization Monitoring Framework
As systems become more and more complicated after additional non functional components such as logging, metrics, profiling, etc. are added, the availability of the main application components could be impacted. For example, inefficient serialization in logging can potentially increase the CPU utilization and impact the performance of major functional components. While system can be optimized when implemented, the problems are usually hard to be detected in local testing due to insufficient traffic load and lack of testing scenarios.
Dynamic Switch
To handle unexpected increased utilization of hardware components, a novel framework was designed and implemented to turn off non functional components that consume a lot of resources. After the system was recovered, the framework turned on the non functional components and logged records in the backend storage for further investigation. The framework was configured as a scheduled job in the system. It ran periodically to monitor the hardware utilization and ensured the system was healthy, up and running.
Telemetry
Hardware utilization data was collected from the host by using a Linux telemetry library. The telemetry daemon ran in an independent process and logged and updated the utilization data in the disk periodically.