Monitoring AWS Lambda can be a complex and potentially costly endeavor. Here’s what you need to know to stay on track and on budget
Organizations are already experiencing a shift toward serverless cloud computing. According to the O’Reilly “Serverless Survey 2019,” around 40% of respondents said they had implemented serverless architecture. Forrester “Global Business Technographics” Developer 2019 found that 49% of companies are using it or planning to use it in the next 12 months.
AWS leads the way in this space. According to Serverless.com, 96% of respondents choose AWS Lambda over Azure Functions and Google Cloud Functions. AWS Lambda helps us to design a lightweight and robust application architecture that can be built and deployed without any hassle. However, nothing comes for free. There are several, potentially costly, complexities when monitoring AWS Lambda. Fortunately, there is an array of solutions.
Logging
By default, AWS Lambda’s stdout and stderr messages are sent to CloudWatch Logs, where every function has its log group. A log group is a collection of log streams (sequence of log events) that can be stored as long as you want. Storage of logs costs USD 0.03/GB and data ingestion costs around USD 0.50/GB.
It is common practice to break up your Lambdas into single responsibility functions. This means, for example, if you need one Lambda to read a certain object and another to write it, you’re setting up two separate functions. However, this also means two log groups.
This creates a challenge when you want to analyze your application’s output log. There is no direct way to log queries across multiple log groups. In response to this, AWS enabled CloudWatch Log Insights to conduct cross-log querying. While this brought many of the improvements users wanted and needed, analyzing multiple log groups together meant a higher cost for queries. This prohibitive cost model encouraged many to look to external services.
Tracing
AWS X-Ray enables you to trace the underlying components of your application and helps you to dig out the fundamental reason behind distributed issues. It has quickly become an essential troubleshooting and optimization tool for engineers.
For example, cold start latency can be problematic for AWS Lambda functions with slow startup times. Although you can counter this by enabling Provisioned Concurrency, you will need to pay extra.
If you have budget constraints and still want to resolve the issue, X-Ray might help you eliminate or at least optimize your application by removing the bottleneck. It lets you know, among other things, if your application is consuming more time during initialization or invocation. You can find X-Ray SDK on AWS Lambda for Go, Java, Node.js, Python and .NET.
Debugging
Traditional architecture and serverless architectures are as different as day and night. Serverless requires changes across the software engineering spectrum, including debugging.
A serverless application is a combination of various services or resources within infrastructure, which can include tools such as AWS API Gateway, CloudWatch and SNS. Such distribution of web services helps avoid a single point of failure (SPOF), but this design creates a problem. Debugging or replicating an issue on a local machine is not always feasible, and debugging a production application is always a dangerous proposition.
Moreover, the abstraction of AWS Lambda makes debugging complicated, as you don’t have direct access to the underlying OS. Hunting down dispersed logs adds to this overhead. As your serverless infrastructure scales, it becomes difficult to identify the root cause.
To combat this, developers can use AWS native tools such as CloudWatch, X-Ray or Serverless Application Model (also known as SAM). Any of these tools on their own can be difficult to use; together, however, they form a powerful debugging suite.
Outside of the AWS-native toolset, some tools provide further assistance with debugging. For example, the Serverless framework, which is an open source tool to develop and deploy serverless applications over AWS, lets you see your live application’s function log, for example:
Monitoring
Robust monitoring ensures the smooth operation of an application and helps to detect issues quickly. In traditional infrastructure, metrics only appear from a handful of sources, so collecting them together is less of a challenge. In a serverless architecture, this problem is much more complex.
AWS gives you the freedom of converting CloudWatch Logs to metrics. These metrics can be filtered, measured and, when a threshold is crossed, alerted on. You can define a metric filter by choosing a log group created by your serverless function and target the log stream. This enables you to create a visual reference for your log events. Every custom metric you create will cost $0.30/10000 metrics per month.
Metric filters can be a powerful tool, but it can also be a bit overwhelming to set up everything, and there are certain bottlenecks as well. For instance, you can’t parameterize metric names, which forces you to create a filter for every custom metric you want to track. There is also a hard limit of 100 metric filters per log group, which restricts you as your application evolves.
Other than this, AWS resources are tightly coupled with each other, which makes AWS-native monitoring tools a fine choice when you need to monitor individual AWS resources or Lambda functions. To get an infrastructure-wide view and to monitor other third-party tools, a centralized logging approach is necessary to take full advantage of the log data.
Conclusion
There is no denying that serverless tooling quickly unlocks high availability, scalability and cost optimization. To make the most of this new paradigm, engineers need to build a robust and reliable monitoring system. The ability to manage log data streaming from many components, detect and trace issues on the fly, and deliver accurate insights to keep on improving the application is crucial to remain online and competitive.
While serverless solutions give access to many of the technical features we need, the dangers of vendor lock-in and the power of third-party providers must be considered if you want to scale your organization and its infrastructure smoothly.