End User Monitoring in Production

On April 2, Zachary Henderson, Lead Solution Engineer at Catchpoint, spoke at our Test in Production Meetup on Twitch.

Zachary explained how proper RUM and synthetic data (monitoring in production) can be leveraged as a way to also test in production. He also shared that, for actionable end-user monitoring, you need a system that can ingest large amounts of raw data, slice and dice that raw data with any number of dimensions, and visualize it appropriately.

Watch Zachary's full talk. Join our next Test in Production Meetup on Twitch.

FULL TRANSCRIPT:

Zach Henderson: I'm very good. Thank you Yoz for having me. It's a pleasure to talk to you all today.

Yoz Grahame: Excellent. Thank you so much for joining us. And you are going to talk to us about end-user monitoring in production.

Zach Henderson: That's correct.

Yoz Grahame: Which is useful for all kinds of things, especially when you're testing in production. We will start you off in just a moment. Just a reminder that Zach will be presenting for roughly how long? About 15, 20 minutes, something like that.

Zach Henderson: Yeah, just about.

Yoz Grahame: Just about. And we have these Twitch chat live and myself and Kim, our producer, will be fielding questions in the chat. But we'll also be taking questions with Zach after his talk on all matters of testing in production and especially monitoring in production. So I will switch it over to your slides and take it away, Zach.

Zach Henderson: Great, thank you. Yeah. And so good day everyone. Happy Thursday. As Yoz was saying Zach Henderson here. I'm a solution engineer at Catchpoint. We are a monitoring company. So today's talk is all about end-user monitoring. So the idea of this group testing and production, there's a whole lot of things that go into testing and production. Monitoring is just one key part of that, usually after release or after post-release. This is where you're trying to guarantee that the systems and services that you built are actually working for the people that you built them for. I know there's a whole lot of things to testing in production, so today's talk is just mostly on that monitoring piece. But if you have any questions about other pieces, here are my thoughts on it, happy to talk through at least what I know on those angles as well.

And so the general idea of end-user monitoring in production, this is the fact that it is in production. You're essentially monitoring the world and the world is asking, Hey, can you tell me how your real users actually perceive your app or services performance and its availability? Because if you understand how real users perceive it, it's basically the key to optimizing the right things, finding the right issues to fix and improving your user experience overall. And so when we say production and monitoring of it, it's a very big world. Your app that you've developed, you've worked hard on, it will run on thousands of different devices from smartphones, laptops, TVs, gaming consoles, different websites, web browsers, all those things. And those devices that you're trying to get understanding of for around the globe and different countries, different regions, all have a different set of capabilities, resources, constraints, and there's a whole variety of the network piece.

The internet is not just one big network it's a bunch a little tiny networks talking to each other. And so the idea of can you monitor that type of interaction for your users? And it's just a lot of variables to consider. So in general, the idea is that those different things all work together for a user. So when you have your app or system that you've developed, you've pushed it in the cloud and multiple clouds, maybe you have multiple centers, you're co-located at somewhere for example. Your users want to access those apps, they want to access how you've delivered those apps. And so they'll come from their home, their coffee shops, their offices, their environments. Or with different browsers, different machines from anywhere around the globe. And they'll access things through their Last Mile network that they pay internet for and those Last Mile networks to carry traffic over the backbone of the internet.

And they'll talk to your DNS providers, your DNS servers, your content delivery networks, security vendors you have, even third party services that you integrate with the app. And for a user, they don't just see your app in the cloud or in the data center. They see this whole type of complex user journey. And so it's a lot of things to think about. It's overwhelming a lot of times just thinking, Hey, I'm building something. Is it actually going to work for the users? So I need to go ahead and monitor that. And so the idea is that you can actually boil it down to these four general concepts. These are kind of pillars if you think about it, which you can build off of different strategies on for monitoring your end user experience. So the first is availability.

Is your application simply up? If it's down, it's going to be down everywhere regardless, right? But then another idea is, is it reachable? Because your app can be 100% up but if your users can't get to it then to them it's down. And the idea that you can't reach it, it's not available for them. So that's a really big concern that people have to think about. And then the idea is reliability, can they actually trust it if you're developing something that they need to do the work, that they need to do to see and talk to their friends and family members, all these kind of different things you use the internet for. Can they trust it to do what you expect whenever they use it? And then finally the idea of performance. If you're up, you're reachable and you have some trust built into those factors.

Is it fast enough to do what they need to do? And your app is going to be completely different than someone else's app is developing? But users expect speed, they expect quality of service and they don't want to have to wait because bad performance is really just a new downtime nowadays. And so the only really way to actually monitor your real user, the actual end-users themselves, these aren't really two methodologies. You can use Real User Monitoring, which is idea of real-time data captured from those actual visitors to your app or service. And that's the actual user experience being reported back. And the idea of synthetic monitoring, which is real-time data is captured from robotic users. The idea there is that these two methodologies can really compliment each other. So you get a real user understanding of the system and you get a really proactive, protective understanding of your system from these robotic traffic to your site, for example or service.

And so looking back at those different concepts. Availability. It's only going to be basically internal synthetic monitoring. You want to have a robot that's in your data center or your cloud environment that's constantly pinging and probing all the things that you rely on from that area. So you can prove that that robot is up and telling you that things are available. And usually you actually want that robot just a little bit outside of some sort of environment that you're hosted in so you can have out of band monitoring and availability. The idea of reachability, all people tend to think it's external synthetic monitoring can a robot at 24/7 from different areas around the globe on different networks. Last Mile, Backbone, wireless networks can actually reach different parts of my system.

Actually there's some, some technology out there nowadays. We actually use some supporting RUM data to actually answer this idea of reachability and they know reliability and performance. Those are both going to be datasets that you can get from synthetic, internal and external and real user technology as you get traffic visiting your site. So the idea here is that let's kind of break it down for availability. Availability is, is your application up and can it respond to requests in a very simple thing. So it's basically your app in the cloud or your data center asking am I up? am I available? Can I actually do what you've built me to do? So essentially that's just active synthetic probes where you have some sort of robotic agent that's probing different parts of your system and saying, are you up or are you down? And again, it's from those data centers or cloud regions, this idea of availability for down there and you're going to be down everywhere.

So pure availability is most just about where you're hosted, where your application is living. And then usually it's just a quick example for like a web app, very popular example is mostly just getting an HTTP 200 response code from a web server, maybe doing some sort of content validation. That's a majority of what people were doing, where just kind of simple availability checks and say, Hey, can this thing actually reply back in a healthy manner? But for your database or for your load balancers or other parts of your system. There may be different types of checks you would do just health checks of your application. And then the idea is that you do this for both user facing application, the thing that actually users are accessing that help the package that you're delivering, but also do it for all your dependent services.

Because if those dependent services are down, the user would probably see part of their application being down. And to them, it's not going to be a good experience. So the system you built has dependent services. Make sure you're monitoring the availability of all those dependent services. And if you're building SLOs, SLAs, SLIs around those uptime of your services, really understand how those are actually rolling up to that user experience. This is where you get SLA calculus it's a little bit idea of if my database has to be up 99.95% of the time, that means my users may have a bit of a different SLA at 99% for example. So calculus there is really important, but for most SLAs, SLOs, SLIs it's purely based on this type of availability. So it was a lot of things that you can typically control in your environment.

And so a few things to consider, it's binary in nature. Either you're up or you're down, kind of the general feel there. Each service and dependency will have its own availability. But the user is basically asking, Hey, is this service available to me? And that's a really critical thing. And then usually for a lot of people it's actually the easiest to monitor, understand and conceptualize. You can really understand if your service is up or down and you can really get a sense of monitoring that. And so in regards to availability, synthetic traffic, that robotic traffic is really the only way you can actually measure this uptime. If you rely on your users, what you're basically doing is saying you're down. When you get a bunch of messages on Twitter where you get a bunch of calls to your support channels and you can't really rely on traffic volumes because you have no idea what's causing those dips in traffic.

You're only knowing that perhaps your site or service could be down for example. And then remember your app can still be 100% available. It can be completely up, but users can still complain about something not working quite right with the app. Hence the idea of reachability. And so can a user from their home or office traverse the actual internet, their local network, talk to your DNS provider, talk to your CDN provider, any security providers you have, any third party services you could embedded and get through all those things. Actually access the app you have running in the cloud, in your data center. And so this idea of reachability is a great big concept. A lot of people are trying to get their heads around how can we make the internet more reliable to guarantee reachability of our services for our customers?

Because they're really at the end game. You want to have your app available and reachable to your actual users. And so a few things to consider, when you talk about reachability, that's basically is my origin reachable? Is my DNS server reachable? Is my CDN provider reachable? If none of those services are reachable on the internet, a lot of times you're going to be basically relying on data that's in your application or code that can help you answer, is my system reachable? And then basically the idea is that for reachability, a lot of times network conditions matter. So can you actually see, are you reachable from different ISPs? For example, my reachability to Zoom or Facebook or Amazon from Boston on Comcast is completely different to someone else's reachability when they're coming from San Francisco or Seattle on Verizon or some other Last Mile providers.

So where you stand on the internet really changes your view of how things are reachable. And then a general sense too is that not only are you need to monitor the reachability of your code or your systems that you rely on, but third parties as well. So if you rely on data from Facebook or rely on ads, calls, you're making your services monetarily worth it for the business. Or maybe you have APIs you rely on to pull data in. If they're not reachable to you, they're essentially down. So really understanding that because that type of stuff can cascade and impact your actual users. And then a lot of times people would tend to think that reachability is something they can't control. I can't do anything about Boston and Comcast having an issue. But the idea is that 95% of your users don't really know that.

They see your site. ƒThey see trying to get your site and they're going to complain to you that you actually can't reach it. And they're going to say, Hey, why can't I get to your service? And they're going to hold you accountable. Because if you think about it, the next thing that I do is go to google.com and say, Hey, I can get to Google. Everything's fine where I am, but for some reason about your system, they can't actually reach it. You're basically being compared to the likes of Google, Amazon, Apple, Facebook for example. And these reachability issues, you can't always fix them, but you'll still be held accountable and have to communicate that and you're going to be basically wasting time trying to figure out, is it us? Is it the network? Is it some sort of third party that's causing issues for our customers?

And usually again, it's highly network dependent and infrastructure dependent. So think your DNS resolution issues. Think BGP route leaks in the internet, poorly performing internet service providers, all these things that you're trying to work with. If you don't have a handle or visibility into these type of components of your system, they can really cause some issues. But it's not only adjust the network as well as it can be your application your infrastructure, your environments your hosted in. The firewalls you rely on, the load balancers and things like that. Many have those components aren't working. Essentially your site could be your app could be 100% running but user can't actually get to it. And also a really key thing that I mentioned earlier with reachability is that a lot of people think it's only synthetic data that can use this because you have like, Hey you have a robot 24/7 to reach my service.

But there's actually a nice set of work that Google is doing with network error logging. Where you can actually tie in your domain and have Chrome actually report back anytime a real user who is trying to access your site or service or access your domain and they get a DNS there or they get a TCP timeout error. So it's actually been a lot of work recently in trying to get real user data coming into your site, your systems, your logging, whatever you're using and say, can actual users get to my system? So is my DNS reachable? Is it failing for real users? Is it closed? Is TCP drops being aborted and things like that? So you can talk more about this in general, but do think that there is some cool stuff you can do with real user traffic to define am I reachable out there in the internet?

And so once you're available, once you're reachable, you want to answer, am I fast enough for my users? And so that's where performance comes in one of those other pillars of end user monitoring. And so a general sense here is first off, what does performance actually mean to your business? To your users? In the metrics you use matter because those metrics are what you're going to be explaining to your user base about what's going on. And so a lot of people have done work in the environments to say, Hey, what does it actually mean for performance to happen for our users? So you'll see a metrics like first ping times first meaningful ping times time to interactive, the idea of what a user actually sees when they try to load your site, your service, your system. So these kinds of components, these user centric metrics are really powerful for you to explain and understand what it actually means to your site or service to be fast for your users.

Basically, if you look at it this way, you can say, Hey, is my service happening? Is it useful? Is it usable? And if you take a look at these high level user centric KPIs, you can then understand, Hey, is it the network? Is it the code? Is it some sort of third party issue? Is it a [inaudible 00:15:09] browser version that they're running on? All those kinds of underlying variables, you answer as long as you have these KPIs to really understand that user journey first, not look at just basic response times or latency on the network or things like that. And so the two concepts we talked about here is that real user data will be the actual performance of your users. You can't really argue with real user browsers nowadays reporting back the latency and performance and experience of your actual visitors.

It's all standardized. It's all you can agree on it and it's coming from actual visitors and whereas synthetic data is going to be that proactive protection and that consistent 24/7 baseline of your performance that you can use to get alerts off of. You can use to basically compare your rare user traffic against and you can use that to make continuous improvements and know exactly how your system or service is getting better. If you do synthetic in a way where you try to get as close as possible to real user data, you can really combine that synthetic robotic performance data to really see the impact it would have on your real user traffic. And so performance has a lot of variables involved. That user journey that we talked about before is pretty complex.

You have your browser, your version of your browser, your OS, the continents, regions, networks, providers, pages that people are visiting on. There's a whole ton of variables to consider. So when you're looking at a tool, analyze that data, really make sure that you can ask those questions on how is my performance by this browser, by this version, by this network, by this region, and not being limited to a set of predefined views initially. So the idea of asking questions on that data and make sure that data is accurate and of high quality is really important. Because if you think about this, you're going to use this type of end user data to make informed decisions on should we invest in a new CDN in that area, should we invest in maybe a backup DNS provider because we're seeing DNS latency causing first paint and interactive issues on our site?

Or perhaps we need to roll back that code that we pushed out and we thought we tested pretty well. But we have to roll it back and maybe refactor what we're thinking about in terms of delivering that new feature or function to the product. So having accurate data and quality data is really important. These are really important decisions you're making about the future of your application system. But also if you think about it at the analytics and visualization of that data is almost equally as important. If you showed someone a number that says, Hey, we load in five seconds on average. And then you show them a line chart and they can say, Hey, that's pretty useful information, but they may not know what it actually means to load, for example. So things like film strips and screenshots and other data realisations really get that point home, especially as you communicate to different stakeholders, different user groups and things like that. Because they have to make their own decisions on how to allocate budget, how to prioritize different features and kind of change priority when you're developing and things like that.

So making sure you have the right platform to analyze and visualize that data is also going to be really impactful for how you make use of this type of performance information. And then a really key thing about the performance averages can be very misleading. So it's kind of to go through a quick example here. So we can all agree that each data set we have here is pretty unique. You have a nice dinosaur dataset, you have a nice circle, bullseye, horizontal lines, stars, things like that. And you can imagine that these are data sets that are coming in from actual visitors to your site, your service. They won't ever look like a dinosaur, but the idea is that you can at a first glance see that there's some sort of Delta change in this data. Whenever they tell you that each data set here has the same X average in the X direction, same average in the Y direction, same standard deviation in the X direction and same standard deviation, the Y correction.

So if you're purely looking at numbers like means, standard deviation and things of that nature, you're kind of missing the point of the fact that underneath all that there is some sort of pattern in that data that you may not know about. Your star could be completely different than your dinosaur, so to speak. And so the idea is that in performance, especially in monitoring end user performance, raw data really matters because you can make some really powerful conclusions that you may be completely incorrect if you rely on just a calculation or some sort of high level view of that data. Unless you have that raw data to look at and really understand these underlying patterns in the data. And so walking through a bit of a example here. If I said, "Hey, my site on average loads in nine seconds, for our industry that's pretty good."

We compare and contrast against our competitors. But you have no idea what that actually means to your users. So the idea of histograms are really important in a performance analysis where you can say, Hey you that that average or that mean is basically right here, but we have a lot of traffic that's faster than that. We have a whole lot of traffic that's slower than that. And that nine seconds does not actually mean anything to any users. In fact, your average speed or page load time, you can actually have an example where no traffic is actually at that average and you have two very divergent or bi-modal peaks in your data. And so even things like time series like can say, Hey, yeah, we have an average of our page load time, we see a spike here. But you don't really know how many people are really slow, how many people are really fast, for example.

And so underneath that taking the histogram view and looking at it in terms of heat maps really tell you where a lot of your traffic tends to lie over time. And that when there's a spike and say, Hey, actually all of our users had a poor performance in this page load time or maybe perhaps only a certain subset over that time. So data visualization, having tools that can bucket this data and not look at just numbers or trends, actually the underlying distribution behavior of that performance data can be really impactful and can really help you understand your performance, not just report on it to have a number for it, for example. And so the idea of reliability is the hardest part in my opinion, this idea of making sure it's consistent, your availability, your reachability and performance.

Because if you think about it, user reliability, the actual thing you're delivering to your site for your user base is basically an emergent property of many unrelated independent systems. If I visit a site or service and they have a missing widget on the page, that's an unreliable service. Even though everything else in the page can be completely reliable or available. So for a user base, don't define your own reliability. Your users have to define that reliability for you. And you think about it, if you have a reliability component that you're analyzing that's been determined by your user base, then you're basically having really good customer success because you're making sure that the users accessing your service are happy in a way that you can actually measure and monitor.

You're not doing it the other way around where you say, Hey, it doesn't really matter our site's available 99.9% of the time. But for you our particular user, it was down two hours during the day, for example. So really listen to your user base and have them tell you what the site and service reliability should be and then work backwards from there. And so the idea here is that longterm trends also matter quite a bit, especially with the metrics that we've talked about and dimensions of the data that we talked about. So to show you a quick example of that, let me show you this data set here. This is actually some data we're doing a Google search. The idea is like, Hey, Google has been pretty consistent in how long it takes to deliver all the things when you do a Google search, for example. You can say, Hey, actually they're pretty reliable. They're pretty consistent.

Now, if I were to tell you that this is basically a month's worth of data here, you'd think, well, I mean Google is doing a really interesting job in keeping their search times pretty consistent. But then think about the longterm trend of this. If I were to tell you that let's zoom out a bit and look at about a year's worth of data. What you start to see is that there's actually a long bit of a delta here in terms of how long it takes that search to run in Google for example. And that actual dataset I was showing you before was just November. Right here. You've actually looked at it and say, Hey, show me this trend over a full year. You start to see that actually they actually got a bit slower in October and November. And then they improved it and actually got better in January, February and March for example.

Idea of longterm trends matter and having that data quality behind it to really ask and see over this type of long timeframe will be really impactful because if you just looked at the very small windows of data as they come in or let's say even over a months time, you're going to miss those longterm reliability concerns. That could actually impact your business and make it really hard for users to trust your system overall. And so the hard part here is that you need a monitoring methodology. Any people and processes in place that have the rigor of a data scientist. And if you take a longterm trend separate out by things that you know you want to separate it out maybe by region or country or ISP, but also be able to ask questions on that data that you may not know about as well.

So the idea of observability comes into play here where as you get this data coming in from real users and synthetic, you had to be to ask questions of that data. The ones that you can think of, ones that you can't think of. And so the idea here is when you have a monitoring system in place, just make sure that you can leverage it in ways that you can't think of. Which is a really catch 22 like how do I know what to think of? So when you ask the data, think about data quality, data integrity, and make sure you can ask the right questions so you're not being misled by some sort of tool making poor decisions off of it, for example. And so here's a quick example in action. At Catchpoint, we actually can't drink our own champagne so to speak, and we actually monitor all of our SaaS tools that we leverage.

So we're kind of an O 365 shop. And what you start to see here that we have a trend, say, Hey, there's a bit of a dip in availability for some O 365 tools like SharePoint, Outlook, OneDrive for example, and even a spike in performance during that timeframe. This is an actual example of the impacted some of my colleagues, for example. You even says, Hey, this performance spike here, you can actually see it was only happening for a certain subset of values. There were some performance rangers that were actually pretty consistent where it has been over time. They've got an idea of a heat map. And what's really important is that you want to ask questions on that data. And so what we did is say, "Hey, let's see if it's actually localized to a particular region or office or city."

And so again, you see we have that raw data that scatterplot and say, Hey, this is actually, it looks like it's just happening from New York for example, as opposed to where I'm based in Boston or my colleagues in Bangalore for example. Some sort of very localized issue here accessing O 365 from New York. So you have some hints of a problem here. And it's happening in New York, it's O 365. But then you want to see what does it actually mean for this to be slow for are my colleagues and the customers you work with. And what you start to see is that Hey actually it means nothing's being displayed on the screen and there's a very long wait time of things actually rendering on the page but even things like high interactive time. So you have a sense now of yeah, it's actually is impacting actual people.

This is actually how a user would feel the site could be slow. And if some really kind of issue going on here that we should probably take a look at and try to see if we can improve and talk to Microsoft about. And so the idea again of analytics and visualization matter. So we're bit of a small company, but we have about 250 employees. But we use maybe over 100 SaaS apps across engineering and product and sales and all these tools. So we actually monitor all those systems. And what you start to see is that Hey, there's actually something happening where everything from the perspective of New York and our office was seeing a spike in performance. And then you kind of start to say, "Hey, it's something in New York, it seems to be impacting a bunch of different tools that we're monitoring."

But what you start to see here is that Hey, there's some sort of network component that's different between the failures and the successes. So in our New York office, we have a main ISP pilot. We have a backup ISP provider AT&T, for example, we kind of get a real sense here that Hey, when traffic went through AT&T for just a quick moment, everything got poor performance. And all of a sudden there's some sort of issues going on here. So you guys sense, Hey some sort of network issue going on here that's impacting the user and the user experience. And so as we drill down further say, "Hey, I know network data is usually best visualized when you have some sort of path or flow analysis. You actually kind of start to see, Hey when traffic went through pilot on the way through other providers to reach Microsoft for example, there is no packet loss.

But when it went through AT&T you see packets being dropped, you see high round trip times. So you have a sense of flow and how networks work that power, all of the applications that are running on top of it, for example. So even kind of taking that further and saying, Hey, that timeframe where we saw an issue for users, can we guarantee that during that exact time window we were always on AT&T. This is where things like time series and again that raw data comes in and say yeah actually here on the third hop of the network path trying to reach Microsoft, you see a very sudden shift. Where you will use pilot for one provider is a performance value on the network side and then you switched to AT&T and we were using a different performance.

Basically the actual story is that we had to switch over because they were switching the cable in our local office because they had to basically upgrade it for purposes. We had to switch over to a back over provider. This actually really impacted people during that timeframe. As you start to see that, Hey, we're actual people impacted by this? This is the real user component. What you start to see is that we actually have basically understanding from each user. In our offices and our end points. Well, I see actual traffic coming in to O 365 from our offices. And so you start to see it, it's actually during that same timeframe, all traffic actually died off trying to reach O365. So if you think about what that means, if you're relying purely on real user data to report the issue, you would have no traffic coming in because kind of see it just stops here and then basically starts up right back here.

So the idea is that synthetic data was the only way to actually catch this issue that could have impacted users. But that you can also use real user data to answer network problems and things like that. So we actually were able to show right at that time that the AT&T switch over now everyone was going through pilot, but then no traffic was coming in until we got through that quick upgrade we had to do. And so they can take this further and let's say some individual was complaining or something like that. You can look down to their individual device, the actual applications, what pages and systems are visiting, for example. The idea here is that when you're monitoring in production, sometimes you need to see that high level approach, everything overall uptime and performance across your entire user base. Sometimes you have to actually go down and say, Hey, someone's complaining and let me take a look at their individual session, their individual data and see just exactly how their experience was.

And so that's kind of the idea of what this stuff does in action. Combining real user in synthetic. And I hope you get a general sense here. I'm happy to answer any questions talk through any different scenarios. I've been at Catchpoint for quite a while, so any tools or ideas you have around monitoring in general, I'm happy to help and I appreciate your time. (silence)

Yoz Grahame: There we go. I was muted in place. Sorry, I've been rambling for ages. Thank you so much for that.

Zach Henderson: Oh, you're welcome.

Yoz Grahame: That was a great overview of loads of different kinds of monitoring. We're not done with you yet. I have some questions for you and also we'll be taking questions from the viewers. So if any viewers have any questions, please post them in the chat. We have about 20 minutes left for taking questions. So the first one for me that springs to mind is that you're going through a whole load of very impressive monitoring and statistics. Actually the thing I see in the chat my colleague Dawn is actually asking the same thing, which is that somebody who is new to this and looking at these incredibly impressive graphs and all the different things you're monitoring. It seems incredibly daunting to set up for beginners. So if somebody's getting started with a monitoring on their own systems or trying to improve, where should you start? Where's the best value at?

Zach Henderson: Yeah, that's a great point. Let me go back here. Just those four pillars So the easiest place to start is with availability. That's basically as we were discussing are you up in available? And a lot of times that's where you have these health checks. You have these very simple components to say, is this app up and running and available. And then from there you can work towards reachability, which is basically driving either synthetic traffic or understanding that real user data we talked about. That can also be fairly simple as long as you understand, what you want to monitor. So a lot of times with external synthetics, you want to think about uptime or reachability uptime. Monitoring from as many places as possible or from many locations as possible, as frequently as you can. And then letting that data define baselines for you and then using that to basically improve your strategy over time.

What we tend to recommend a lot of times is you start with availability and reachability and then you use that information and build on it to say, okay, we have a good baseline of how users actually think our uptime and reachability is. Can we now use that data to talk about performance? And that's where a lot of people mature and this idea of monitoring maturity comes into place where you can start passing some really powerful questions on is my performance getting faster or slower than it has been the past day, past week. What sort of KPIs should we look for that really indicate user performance and availability. And one really simple thing to do is if you rely on RUM at first, it's pretty simple way. RUM has matured a lot where it's actually fairly simple to get some sort of instrumentation into your web apps and your clients. And so that's a fairly easy way to get some data in. You kind of have that traffic come in and then you can use that information to inform what you should do from a synthetic angle to say, Hey, where should we actually protect our performance and availability? So let your users tell you what to do first and then have the monitors behind that builds off of your user base is a really good way to start.

Yoz Grahame: For RUM, for setting up real user monitoring, is it the kind of thing where you basically include a script tag in your sentence somewhere and the monitoring system already start just grabbing a whole ton of data just from that script app?

Zach Henderson: Yep, exactly. And the idea is that most modern browsers, actually all modern browser support, the API is for this, the resource timing, navigation, timing, user timing, APIs. And so that data's highly accessible. There's a lot of great libraries out there to access the information. And then the harder part with the RUM piece is not the data capturing. It's making sure you have a system behind it to query, analyze and store that data in the ways you talked about with raw data analytics is questions you can or can't answer at that time. That's usually where people tend to reach out and say, Hey, is there some system out there that has been doing this and can do this for us in that case.

Yoz Grahame: Exactly. And that's what Catchpoint does as other systems is gather all the data and present interesting ways of analyzing, querying, displaying. Yeah. That's fantastic. So that's stand in web browsers. How about for mobile apps?

Zach Henderson: Yeah, that's a great point. So with mobile apps and those thicker clients or native environments, you can't really have a standard web technology that's going to basically send you data like you would in a normal web browser. So with real user, you have to be really consistent and concise on what you want to monitor. So a lot of times the best way to start is have your developers as they develop these apps, really think about these kinds of things that you want to measure and set up. So you can basically develop your own libraries that will make sure in your app and your code that you have some sort of custom timers that you built in. And if you do it the right way, you can actually do it in a way that doesn't add much overhead to those native apps.

And again, like old Android devices can be very resource constraints and you have low memory and CPU. So if you do a very focused approach as you develop these applications to kind of think about some sort of telemetry and systems, you can get that data coming in just like you would for real user browsers. You run data in browsers, but then also, a lot of times what we see nowadays with native apps, it's not the app itself. As long as you have enough powerful resource or endpoint to actually run that app. A lot of times it's the APIs and underlying services that it's calling that can cause the delays and cause those spinning wheels that cause issues on the app. So understanding how your app actually communicates to your backend, communicates to your content delivery networks, all those API functionalities. If you can replicate those in a synthetic manner, that can be really impactful and really help you understand are the API that this app relies on actually fast, reliable, and quick enough for what we need to do.

Yoz Grahame: Great. Okay. So I mean it sounds like more work than with the browser, but are there existing libraries out there for native mobile apps that make it considerably easier to just grab a whole ton of metrics for you? I assume you need to wire them in more places.

Zach Henderson: Yeah, a lot of libraries have it. I know Swift and I think Java you can get timers built in and stuff like that. Very simply, as long as you think about what you want to monitor. For a developer, if they can build the app, they can build some sort of very simple timer into that app very quickly.

Yoz Grahame: Cool. With LaunchDarkly, one of the reason that we talk of the name of this Meetup is Test in Production is because it's like to be able to partition new releases, incredibly small testing audience, usually QA and then do a canary rollout. Which is a small percentage of real site users. So how would you go about limiting the analysis or the monitoring or differentiating in the monitoring between those getting the new version and those getting the previous existing version?

Zach Henderson: Yeah, that's a great question. So the idea of end user monitoring, as you know, there's a lot of factors that we talk about OS, region, country, time of day, devices people access it on. Those are a lot of factors that real user technologies can capture automatically just by doing an IP lookup. The database like MaxMind or something like that. But then you want to start to think about, when you do all these canary rollout and releases, is there some sort of way in like a response header in some sort of JavaScript variable or flag or system that I can actually tell in the client side if I'm on that canary version or if I'm on that newer version? And can my RUM system pick that up and query by that? So the idea it's just easy to simply passing some sort of feature flag that's available on the client side that your RUM library can intercept and grab.

And then filter it by that type of trace ID or AB variable testing or canary flag for example. And just have that data filtered to that particular value. And then all the underlying components will help you answer am I fast, am I a slow, is there any change in my pain times, my render times, my interactive times, all those kind of natures. And then also take that and also apply it to your synthetic data because the synthetic data is going to be basically a robot that's doing the same thing 24/7. And if there's a big change in that robotic traffic, you've got to know that perhaps some users are actually going to be ready to impacted because that robot's impacted. So there's much more likely chance.

Yoz Grahame: Yeah, that's fine [inaudible 00:41:06] as well. When you're doing this, if you do have feature flags or some kind of a flag targeting going on so that you're differentiating between different versions, you want to make sure that you have synthetic coverage for both versions going on at the same time. Which is kind of easy to forget, when you're setting that up and can probably make things more confusing if you're not aware of it.

Zach Henderson: And actually, what we see a lot of times it's easy to forget, but there are ways. We have continuous testing, continuous integration we have pipelines for development. A lot of people nowadays are actually asking can we control your synthetics or real user via an API technology or some sort of integration piece. Whereas we build new things. Can we have the monitors built to actually match what we're building? So automation is a big piece here. I think of as you develop and test that you're automating those checks, maybe automate the monitoring you're doing as you push it out to production as well.

Yoz Grahame: Yeah. It's about updating your definition of done really for any new feature. It's having the synthetic monitoring in there as part of the usual set of tests and all the processes you would normally do when rolling something out. But I suppose the good thing is that while that piles on, it does mean that it can sound like a lot of work to get something out the door, but that's balanced by all the work you later don't have to do suddenly to fix things that suddenly come up out of nowhere.

Zach Henderson: Yup. And it's a great way for the IT org in general to support the business going forward and kind of be another, not just a cost center, but also a enabler of customer success and more profits and more revenue and a better experience for customers overall.

Yoz Grahame: Absolutely. It helps to be able to demonstrate that, I mean that's actually a good question, in terms of how do you display it in the cost center? Because normally it's hard to evaluate the cost of problems that you don't have. How do you tend to do that? Is it just kind of look at an industry average and then say, look at the difference or are there other ways that you recommend for trying to evaluate cost savings?

Zach Henderson: That's a great question. The idea is that no system that you build is ever going to be 100% reliable that's ever going to be 100% reachable or available or things of that nature. So just knowing that the things you build once you put them out in the wild, there are going to be edge cases that you don't know of. They're going to be things that you can't explain that you actually build in production and things like that. So just knowing that and knowing that that's why you monitor it because you want to find the things that you don't know about that are impacting your users. In a very simple sense, that's almost enough ROI in that case. It's to say, Hey, we know we were building the system.

Zach Henderson: The system brings in revenue. It's through ads or you're purchasing something or someone's paying you subscription to access your service. We need to protect that money for the business. And to do that, we have to make sure that any type of monitoring we do with that follows these four things of can we ask questions on the data that we know about that we don't know about? Can we trust the data? Can you report on it and visualize it in a way that stakeholders in the business can understand? But also that our customer support teams can communicate effectively to our customer base on. That whole kind of journey or that wheel of developing testing and monitoring, customer building features. It's how you propel this to really impact your business and say, Hey, this is actually something that it's worth it and it will actually make us the leader in our space going forward.

Yoz Grahame: Yeah. Especially what you're talking about with in terms of the conversation between customer service and customers. One of the really interesting things that came out of your talk is showing the customer service teams can have direct access to what that particular user is experiencing. You can see the highs and lows of their connection. As you said reachability, the internet is a long chain of requests and hops and the weakest link in there is the one that's usually causing the problems. So being able to identify what that is makes a huge difference. I think many of us started in customer service. I certainly did for [inaudible 00:46:10]. And being clueless when going, I know working for me, it's not working for you. We're going through all your settings, can't tell what's happening. Everything that you can do to make that more certain and provide more information to a customer service rep translates directly to increased customer satisfaction. Which is great. So we talked about edge cases. And my favorite topic with this kind of thing is always what are the most interesting bugs? What are the weirdest bugs? What some of the strangest stuff that you've seen in working on this?

Zach Henderson: Yeah, that's a great question. So let me actually bring up an example here that we tend to talk a lot about. Let's see. So a lot of companies today rely on the idea of a content delivery network to essentially get your content closer to the user to improve performance, make your attack surface for DDoS attacks lower and things like that. And also just take load off of your origins and things like that. So a lot of times those systems are great. You're built for what you work for. But in certain markets and areas they perform completely different depending on where a user comes from. Again, where you're standing internet view depends on how you view it. So while a lot of times while all of the things that we help people understand is kind of, is my CDN provider mapping us as we expect, for example.

And so there's a lot of work being done in this place from each of the CDN providers to say, Hey, are we actually delivering traffic as we expect to because that's how we get held accountable for our SLAs for them. So let's take a look here and see, Hey, from Boston on AT&T Akamai a popular CDN provider can actually say, Sir your user is from Ashburn, Virginia, from Billerica, Mass from New York city, from Pittsburgh, from Somerville. And a lot of times you're clueless from this. If you look at just RUM data or just the data that's running in your data center and you can start to see, Hey, actually most of the time they serve users from New York or from the [inaudible 00:48:25] for example.

The times when the time of first bites very high, you can say, Hey actually, well at one time they went to Pittsburgh, a user would have a very bad experience because the network round trip times are high and stuff of that nature. And what's really powerful is as customers start to see this type of thing, you start to think about maybe I should have multiple CDNs maybe I should try to steer traffic to to a pop or a region or a provider depending on conditions that we get from our Real User Monitoring or from our synthetic monitoring. And so you start to see CloudFare only has one pop. Is that an advantage or that disadvantage?

Does that improve performance? Does that improve your network round trip times, for example? Or perhaps CloudFront or other providers out there are there ways for them to kind of understand how these systems work? Because again, the internet's a very distributed place and there are things that people think about is that I'm relying on a CDN to distribute my content. Is it actually effectively mapping it to my user base and can I actually relate that back to the user traffic I'm seeing out there in my user real base traffic and my other analytics that happen on the site? So that's something that we tend to see quite a bit. And it's not a question of which provider is better than others. It's more about what's the best for our business and to guarantee the performance and reliability that we need for our users. Did I lose you guys?

Kim Harrison: Hey, this is Kim. Yes, I think we lost Yoz for a second there. Hopefully he'll be back in a moment. We'll give him 30 seconds.

Zach Henderson: Sure. And I actually do have another example I'll bring up. A lot of times, since we're all working from home remotely, we've been really stressing being on time for meetings and things like that. So in my role, I joined a lot of meetings and I want to make sure I'm on time. But about a week ago, I was seeing that my clock in my local Mac was actually about two or three minutes slow. When I was like, huh, that's very strange. I was actually joining meetings two or three minutes slower than my colleagues and the people I talk to on a day to day basis, that was wasn't too big of a hiccup. It was just a eyeopening. And so knowing that and knowing that time is based not off of ACPD traffic's not based off of DNS traffic or things of that nature.

It's its own protocol. The NTP protocol, one of the oldest protocols in the web actually. The idea is that synthetics can actually apply to more than just your web traffic or your API traffic. Can actually apply to things like your server time that you rely on to sync your servers or in this case, sync my PC so I can join meetings on time. And I've a fun example here that I was like, Hey, I'm a monitoring person. I'm kind of a geek in things like point Catchpoint towards time.apple.com and measure the round trip times to those from different cities around the US. Those was the offset that synthetic probes are seeing from what the kind route time is time.apple.com.

I saw during that time window I was like Hey, from Boston there's a bigger offset for the synthetic agent as opposed to from Vegas or Chicago or Miami or Philadelphia. And so it's like, Hey, now I know for example, that the time protocol I relied on in this case it was maybe having some sort of issues in the Boston region. And I could actually investigate is it the network, is it some sort of local environment? So this is some fun edge cases that once you think about how the internet works can really not only improve my day to day to make sure I'm on meetings on time, but also give a general understanding of how you actually deliver for your customer base overall.

Yoz Grahame: Yeah. That's fantastic. Sorry, my connection disappeared for a couple of minutes there, but thank you for carrying on without me. So one thing I was wondering about that actually is how for those figures, when it's in the red, does it give you a degree of confidence in like the number of users who are seeing those first time [inaudible 00:53:23] issues. In terms of it just one user who may have been on a poor cell phone connection or something, or is it something actually happening for a large number of users?

Zach Henderson: That's a great point. I mean every user is important, but at times you don't want to alert off of maybe just one user having a bad experience. You may want to know about it and be able to drill down into it. But how you operationalize real user data and synthetic data can be really powerful. So what you want to see is my overall user base. Is that one condition, a very common condition or is it an outlier? That's going on. And that's where those things like those histograms and those heat maps can be really powerful. Can you start to see the trend of is my user base shifting right or left in terms of performance? I pin it up here real quick.

Yeah. So you may say, Hey, that one users really poor performance up here, it's in the red. Where if you start to see more of the traffic start to shift towards that, that's where you start to say, Hey, this is actually something that we should really take a look into because that traffic's being shifted towards that long tail essentially. And this is where things like alert profiles, alert granularity, having the option to really specify different types of alerts, offer different types of metrics can be really critical and really help you get ahead of that. And especially if you're relying on RUM data or APM data or some sort of passive dataset for alert. Basically you're saying, Hey, alert me if 10,000 visitors have this type of experience, that's essentially 10,000 people that you might have made agitated and may not like your service anymore.

Whereas if you think about synthetic, that's where the protective piece comes in. This is where you can say, Hey, for this very important user journey that we have identified in our analytics as being really important to our business can we proactively get ahead of that and say, if this happens for two robots in a 10 minute window from this time to this time and it's this type of increase in performance. We should actually take a look at that, investigate it. And it may or may not actually impact real people, but it will help you get ahead for most of the time to help you get ahead of issues that actually would impact real users.

Yoz Grahame: That's vital. Thank you so much. Sorry again for disappearing at that time, but thank you so much for carrying on without me. This has been a fantastically helpful look at Real User Monitoring and synthetic monitoring. It was eye opening to me to see, especially the reachability aspects, the fact that you can show where it may requests things are going bad is incredibly helpful. And being able to especially partition that according to when you are doing canary rollouts or other kind of targeted deploys is super helpful.

Thank you so much for joining us.

Zach Henderson: You're welcome.

Yoz Grahame: That's it from us today. Now that we've delivered two of these, we are making a regular habit of it. So please join us next Thursday at 10:00 AM Pacific time, which is 6:00 PM UTC. Thank you so much Zach Henderson for teaching us about front-end monitoring. Zach is the leader lead senior solution engineer for Catchpoint. My name is Yoz Grahame and I wave my hands about feature flagging for LaunchDarkly. Thank you all for watching us. Thank you, Zach, for joining us and have an excellent week. Fare well.

Zach Henderson: Take care, everyone. Thank you all.

End-User Monitoring in Production

Like what you read?

More about DevOps

Like what you read?