by Kevin McGrath
Lambda has quickly become a core service in the AWS lineup. A utility knife for developers, Lambda can be as simple or complex as needed. It is the glue between AWS resources, a simple event-based function, or the workhorse for a full serverless application. However, with great power comes great responsibility. One of those responsibilities is to understand Lambda's at least once execution policy.
One often overlooked requirement for Lambda functions is that they must be idempotent. Every invocation has a chance of running multiple times, regardless of successful or failed executions.
Invocations occur at least once in response to an event and functions must be idempotent to handle this.
For workloads that always produce the same artefacts in the same location, there may be no reason to worry about this requirement. However, for workloads that create new artefacts or resources on each execution, this becomes problematic. Multiple create statements or API calls with no correlation mechanism will result in an equal number of artefacts and resources upon completion. If those outputs also incur cost, not accounting for this “at least once” approach will go straight to the bottom line.
We first encountered this phenomenon when working with Spotinst to create spotinst-lambda, a project based on lambda-formation. The integration allows CloudFormation to manage Spotinst Elastigroups as it would any other AWS resource. Create, Update and Delete actions from CloudFormation invoke Lambda functions that communicate with the Spotinst API, ultimately returning status back to CloudFormation. Spotinst Elastigroups show up alongside all other AWS resources with the same look and feel as the rest of our AWS environment.
At scale, we noticed random successful executions would create multiple Elastigroups. This would double or triple the infrastructure needed for a CloudFormation Stack. First, we exhaustively tested the code to ensure it was not part of the lambda-formation stack. Spotinst also confirmed from their side that the API was acting correctly. Finally, we isolated Lambda as the root cause. Spotinst quickly updated their API to correlate duplicate requests, which ultimately fixed the problem.
One extremely valuable resource in our debugging endeavours was the Serverless Forum Slack channel . Many helpful hands were willing to dig into architecture, logs and code as we figured out edge cases in the #debug-buffet channel.
Although the solution to correlate API requests was not difficult to implement, we wanted to dig deeper into how often multiple executions were occurring. We created a test project that received CloudFormation events and sent API calls normally intended for Spotinst to AWS API Gateway. The API Gateway called mocked services to return successful results for Create and Delete actions. All invokes were then tracked with unique short IDs and atomic counters in DynamoDB.
With the mock service in place, we scripted CloudFormation to continually run five concurrent stack creations for an entire month. After a successful creation, the script would also delete the stack so that we stayed under any soft limits and kept the stack list as clean as possible.
The following is an example of a successful Lambda function running twice. Notice that the eventId is the same for both events. Without the function reporting itself to DynamoDB, the only way to differentiate the executions is by the ingestionTime and timestamp fields.
Dashboard of all create function executions, with count on the y-axis and 24-hour time intervals on the x-axis.
Over the course of the month, 491,699 create functions were executed resulting in 112 multiple invokes. Of the 112, only three caused triple executions, and none occurred four or more times. While a 0.02% multiple execution rate is extremely low, it becomes instantly noticeable when the result is two or three times more infrastructure and spend than originally intended.
Going forward, know that Lambda will eventually execute functions more than once. Whether you use correlation IDs, atomic counters or another de-duplication pattern, you could be creating multiple rows in a database, multiple resources or twice as many artefacts as intended. The documentation calls this behaviour out, but I’ve rarely seen it accounted for in the many examples I’ve run across to date.