skip to content
Home

What the H*ck is Observability?

with Shelby Spees

Modern apps have a lot of moving parts. How can we tell what is happening in our applications and why? In this episode, Shelby Spees teaches us how to use Honeycomb to debug complex systems in production.

Transcript

Captions provided by White Coat Captioning (https://whitecoatcaptioning.com/). Communication Access Realtime Translation (CART) is provided in order to facilitate communication accessibility and may not be a totally verbatim record of the proceedings.

JASON: Hello, everyone. And welcome to another episode of "Learn with Jason." Today on the show, we have Shelby Spees. Shelby, thank you so much for coming on the show.

SHELBY: Thank you for having me. I'm real excited to be here.

JASON: I'm real excited for today. I was telling you before the show usually when I do an episode of "Learn with Jason," I'm somewhat familiar with the material, so I'm just kind of pretending not to know what's going on. This episode though I literally have no idea what's going on. This is a completely new subject area for me. You can tell me anything, and I'm going to believe you.
(Laughter)

SHELBY: I'll be sure to confuse you extra hard then.

JASON: Good. The chat will definitely appreciate that. For those of us who are not familiar with you or your work, do you want to give us a little bit of a background?

SHELBY: Sure. My name is Shelby. I've been working at Honeycomb since March. Before that, I was working on the application side of software development. My job as a developer advocate at Honeycomb is to have people have healthier experiences and healthier relationships with their production systems and make it easier to deliver business value while building software. Observability is one tool in our tool belt to be able to do so. I'm really excited to talk about that today.

JASON: I'm really excited to talk about it because it is something if you follow Charity Majors on Twitter, she has a lot of amazing threads talking about how much of a superpower adding observability is.
What hasn't clicked for me is it is very obvious that observability is not the same as just logging. But I think where the disconnect for me is I'm not 100% sure why it's not the same thing.

SHELBY: Yeah. The textbook definition of observability for software is can you ask me questions with the same data. The thing with logging and with traditional metrics is it's a very static data structure that you can't slice and dice in new ways. Your metrics and logs, if you have tried to debug and do benchmarking on anything application level, you have to be very it is like hard coding the exact axes of data you want to include. You want to send lots and lots of fields with your structured data. You can use structured logs to do this. There are a lot of integrations with observability. My favorite way is to serve framework level integrations that know, for example, what an HTTP request looks like.
You need tooling that allows you to slice and dice that data later on in novel ways. It's not just about the data structure. It's also tooling that you can ask novel questions and start to interrogate your systems and learn new things, even for things you would never think to log for.

JASON: Interesting. Okay.

SHELBY: Good observability tooling supports sending big, wide structure. Charity talks about this a lot. Just send everything. Later on, you don't know what might become relevant. Your tooling allows you to query by all these novel fields that you would never an arbitrary example is I had no idea users whose favorite color is purple were having a slower experience. Something that is so arbitrary, right?
The structured data that you're sending also supports tracing because you can just build traces out of the same data structure. You just add a start time, a duration, and a parent child relationship. You get that nice waterfall view, if you've ever seen that in tracing tools.

JASON: Oh, okay. The way that would work what we're talking about here is instead of the trace being something that comes out of the code, the trace comes out of having sent the data. This would go across application boundaries even.

SHELBY: Yep.

JASON: Like your frontend and API. You could theoretically get a trace all the way through?

SHELBY: Yes. When you make the API call to your backend, you send along some of the trace data so when that gets sent over to your observability tooling, it can reconstruct the entire distributed trace.

JASON: That's incredible.

SHELBY: Yeah.

JASON: The teams that I've worked on definitely did not do that. We got to play this game of bug whack a mole where we would find a bug in the frontend and then go to the back end and be like, did we get an error over here? Let's look at the logs. Is the time stamp about the same? Let's see if we can reproduce this. It felt very, I don't know, guess and hope for the best. That's cool.

SHELBY: Yeah. That's exactly the debugging approach that most software teams have to take. You're sending sort of disparate data. You can't account for like this was one request between the frontend and the backend. Maybe if you're sending really, really rich logs, it's possible, but then you start to have that make these cost tradeoffs. Do we want to keep all these super robust logs? It is really expensive to index and parse all of them. My log aggregation service is charging me so much money versus having data that is just contextual and upfront. You don't have to do all the string parsing because you already have the key value pairs that say, oh, we know what user this was or we know what the request path was or we know what the API key was, if that's something you're going to send. Then you can start to query by these different fields across all of your traffic. Tracing is great for the very deep, okay, I know this is the part of the code that's the problem. I want to dig deep to see what exactly was happening at that point in time. All your context fields are really useful for that, but what if you have questions that are like, where in the app is it slow? Which endpoint is the slowest? Or who is experience the worst latency or the most errors or things like that? That's a question you want to ask across all of your traffic over time. Your context fields allow you to slice and dice in aggregate as well as in these very narrow specific requests. That's what observability is to me. It's being able to seamlessly move between these big broad questions and then narrow down to, okay, this is weird. How often is this weird thing happening? You can zoom back out and test those big broad questions again.

JASON: That's really cool. This sounds like a dream, right? If everybody had observability in their systems, it would be so much simpler to say, hey, we're having a weird production issue. Let's figure it out. Then you could just figure it out.
What do you think has prevented people from kind of adopting this approach? On the face of it, it sounds like it would be really challenging to get right.

SHELBY: Yes. There's two things that are challenging about it. One is the instrumentation side where having good observability sort of requires instrumenting your code. It's been you already instrument your code. People instrument their code with logs, log lines, and print statements and your Console.log. That's a form of instrumentation. But we haven't standardized guidelines for instrumentability. There's a framework protocol and organization around creating this vendor neutral standard for adding telemetry to your systems, adding instrumentation to produce telemetry that works across all these different vendors.
Before, you're using one vendor, and you install their agent to one alongside your app. Or if you want to use their proprietary package to instrument your code, you have to sort of weave that into every part of your code base. It becomes very expensive to add that. Later on, you decide you've got a new contract with a different vendor. It's a very it can be a time consuming process, and it can be expensive on the development side.

JASON: For sure.

SHELBY: I'm really excited about open telemetry. Right now, it is in beta. In a few years, it is going to be the standard everyone is going to expect. Framework authors are going to build open telemetry compatible instrumentation into their frameworks so you can just flip a switch and send it to your vendor.

JASON: This is the project that's at opentelemetry.io.

SHELBY: Yes.

JASON: I'm going to drop that into the chat for people to take a look if they want. This all sounds amazing. What you're doing at Honeycomb is trying to make this more accessible to teams, right? Are you using open telemetry under the hood?

SHELBY: We're not using open telemetry. When we built Honeycomb, open telemetry wasn't around yet.

JASON: Oh, okay.

SHELBY: There wasn't a vendor neutral approach to instrumentation that we could use internally, so we built our own beelines to support the instrumenting of code that we thought would really shine when you send your data to Honeycomb.
What's different about Honeycomb is our backend supports high dimensionality and high cardinality. High dimensionality means you have a data object or structure that has a lot of fields. This isn't what you would send to Honeycomb, but you would normally have an email and a first name and last name and stuff. But if you add user favorite color and user eye color and everywhere you've lived in the last 15 years and how many shirts do you own, what is each shirt that you own, how many colored socks do you own? You start to have all these different fields that maybe it is really important for your service to know exactly what pair of socks your user is wearing today. Those are all the dimensions, all the fields you include with your data.
High cardinality. Cardinality is how many possible values can a field have. For your user email, that could be just anything, right?

JASON: Yeah.

SHELBY: There's millions and billions of emails out there in the world. Something like a build ID is continuously incrementing. You have infinitely probably some limit, but potentially infinite builds that you can have out there. To be able to filter by or query by a specific build ID, it's very, very hard. A lot of traditional monitoring and logging tools don't support high cardinality on the backend.

JASON: Oh, okay.

SHELBY: Conversely, Honeycomb was built specifically to support this.

JASON: Gotcha.

SHELBY: And we had to build our own people say don't build your own database. We call it a distributed columns store. It's not a database. It's a distributed columns store to support querying for high cardinality fields. That's sort of the difference. That's the thing you'll see when we get to I trust we're going to get to the point where you'll be sending data into Honeycomb.

JASON: I believe in us.

SHELBY: Yeah. That's why it's important to think about observability in the unique way when it comes to software. It is more than I just have metrics. I have logs. It is can I interact with the data I'm sending and ask novel questions and learn new things without having to deploy new instrumentation, without having to go in and add new lines of code. That allows you to and this is something our customers do. It's something we do internally. There's some incident or some issue. We go back from a week ago or we go back to last night. The on call person was able to mitigate the issue, but they weren't able to debug it entirely at 2:00 in the morning. The next day, with a full night of sleep and some coffee, we can go back and look at our telemetry data. It has all the context there from the issue back to the morning. We don't even have to worry about trying to reproduce it quite yet. We have everything there from when it happened the first time versus I don't know if you've ever had that happen before where people are complaining about your service, and you're like, well, I think this is the problem. Then you deploy something to try and monitor that.

JASON: Yes. Yes. It's my favorite.

SHELBY: That's exactly what observability protects you from. It saves you from all of that. It like to call it armchair debugging when people are sitting at their desks going, it could be this part of the app or maybe this and that. Then they add some new metrics and add some new log lines and turn up the debug level. It's like totally just a completely different part of the system, right?

SHELBY: Yeah. I'm really excited to just it's like an order of magnitude, more productivity for engineers.

JASON: That's really nice.

SHELBY: It's really exciting and satisfying for us to engage with our systems in production not just when there's issues and incidents, but just learning how users actually behave. There's just so much potential there. I'm really excited because I think this is going to be how we work, how high productivity engineering teams work over the next decade.

JASON: That's really exciting. We haven't even started writing code yet and you're getting a lot of love in the chat. There's love for lights in your background. I'm very jealous of that. That is super cool.

SHELBY: Thank you.

JASON: Taylor says, I always knew open telemetry was good, but all sorts of things just clicked in my brain after listening to Shelby and I'm an even bigger fan. Another says, I should not have used up my brain today. I'm with you. This is a lot, but this is why I was excited to do this episode because I think what we just talked about, like what you just explained is a lot. There's so many moving parts. There's so much to get your head around, but that's what you're working to kind of abstract away with Honeycomb so I, as the developer, don't need to know all of that. It helps for me to understand it, but ultimately what I'm able to do now and maybe we should just switch over and start doing this, but I'm just able to install this thing and I get the benefit without having to become an expert in instrumenting for observability.

SHELBY: Yep. That was exactly my experience. I started following Charity on Twitter in early 2019. I was just so overwhelmed by all this terminology, all these new concepts. I don't have a master's in computer science or whatever. I don't understand this stuff. On my last team, they were Honeycomb users before I started working at Honeycomb. I remember being sort of overwhelmed in the beginning. I don't know what steps to take to do this. Part of my job is making things easier so you don't have to be an expert in all this stuff, and you can just start getting the benefits right away. Let's dive in.

JASON: Let's do it. I'm going to switch over to pairing view. We're looking at a couple of things right off the bat. This is the open telemetry site. I'll drop it in the chat again for good measure. I'll drop in Shelby's Twitter. Make sure you go follow her on Twitter.
We're going to be looking at Honeycomb, and we're going to be adding it into stream blitz, which is my little overlay package that we're looking at now that is actually a little bit broken. I saw people doing boots in the chat. I don't know why it's not working yet. That's why we're doing observability. I don't know what I broke, but we'll find it.
Let's do a shout out to our sponsors. The show today is being live captioned by White Coat Captioning. This is so great. I'm so happy to have that. That is available at lwj.dev/live. That is brought to you by Fauna, sanity, Auth0, and Netlify. Head over to lwj.dev/live to get that in your browser.
I have the Honeycomb site open. Let me drop that in the chat. I have stream blitz. Let's open this code. Is this big enough for everyone to read? Here's some code. Shelby, is that legible for you.

SHELBY: Yeah, that looks good.

JASON: Okay, great. This is our little server. The way that this thing works is it is effectively a GraphQL API that uses TMI JS under the hood to send activity from Twitch. It sends me the chat. If we run commands, it will send the command data. I believe I have another command in here that will get data or that might be built into the UI. That all runs. Then I have the new version of the scenes running here, which is a Jamstack site built in Toast, which is Chris Biscardi's Jamstack framework, which is very fun. That's a thing I've been working on for a while.
The way this works under the hood is we have the server running, which I can get actually, I guess I should ask you what's the easiest way to get started here if we want to kind of dig in.

SHELBY: I'll share the docs with you.

JASON: Okay.

SHELBY: For the node beeline. Beeline is our cute term for our SDKs that hook up with your app and generate telemetry data and send it over to Honeycomb itself.
What you're going to do today is add a package to your node app.

JASON: Okay.

SHELBY: You're going to add an API key to the configuration in your app.JS file. Then you're going to run your app and maybe hit some endpoints and watch data come in. Somewhere in there, before the API key, you need to sign up for Honeycomb.

JASON: Okay.

SHELBY: We have a free tier. You can sign up and use it for free and do some dev and not even worry about that.

JASON: I'm installing the beeline. While we're waiting for that, I'm going to sign up with Google. Let me pull this off screen. I'm just logging in with my Google account if I can find that. There it is. What? Oh, no. Hold on. Got to get Google authenticator out. Hold, please. (Laughter). This one. Okay. I have signed up for Honeycomb. We will call this stream blitz. No, wait. I have to create a new team. I'm going to continue. I'm probably going to have to roll that after the

SHELBY: Yeah.

JASON: stream.

SHELBY: It is easy to change in your team settings.

JASON: In my team settings. Those are team settings.

SHELBY: Uh huh.

JASON: If I just edit, I'll be able to

SHELBY: Delete it from there.

JASON: Oh, I guess I need to create a new one first. I'm going to create a new key that I won't show anybody. We'll call this stream blitz. Save it. I've copied that one, and I'm deleting the other one.
You hackers. You dirty hackers.

JASON: That's exactly what you are. I've just disabled that key and it is gone. Now I'm going to go back to, I guess, the dashboard. That continues to show me the thing.

SHELBY: Now it's just waiting for your data to come in. You can click around, and there's some examples. There's "here's what this page does" sort of thing. You can click the data in button. It will have a spinner, like we're waiting for your data.

JASON: I think that showed me the API key.

SHELBY: Yeah, don't share that. Yeah.

JASON: Let's take a look at that. I can send my Honeycomb API key. Is that what you call it in your docs here?

SHELBY: Yeah.

JASON: I'll call it whatever I want.

SHELBY: Uh huh.

JASON: Okay. There is my Honeycomb API key. It is saved and now available to us. That means that I can do the next step, which is here.

SHELBY: Yeah.

JASON: Top of my app JS. We can make that work. Let me copy this. I'm going to go to my server. Here's my index. Can I put it after dot in dot env?

SHELBY: Yes.

JASON: I don't think I had a dataset yet.

SHELBY: Yeah. When you send your first events, it will automatically create your dataset. You can sort of just call it whatever you want. Usually just the service name. Yeah, exactly.

JASON: Is that okay to tell me which part of the stream blitz stack it is coming from? Is that how you would do it?

SHELBY: There are a few different ways to organize your datasets. If you want to separate them by service, I would do it that way. If you want to separate them by environment let's say you have a development dataset and a production dataset. A lot of teams do that, especially if you're using distributed tracing.

JASON: Okay.

SHELBY: That's actually a thing. Right now, we're in the process of making our onboarding docs a lot more straightforward and give a little bit more guidance around the naming and stuff. Yeah.

JASON: That makes sense. Right now, the service is just for me, so I do everything in production, so I'm only going to have one dataset.

SHELBY: Cool.

JASON: The service name is the one I chose when I created my team.

SHELBY: Uh huh.

JASON: Oh, hey, Chris. Thanks.

SHELBY: This would probably be stream blitz server. That's where you would separate your frontend and backend.

JASON: These are both arbitrary. I can choose whatever I want them to be called, and the API key is what ties it to the actual service?

SHELBY: Yes. You don't actually need to tell it your team name because it gets it from the API key.

JASON: Saying that out loud, that, of course, totally makes sense. (Laughter). Okay.

SHELBY: I love doing this because this is exactly the kind of friction that new users experience and I sort of take for granted. I go, of course, you would name your dataset that kind of thing. Naming is important. As developers, we really care about names we give things.

JASON: Yeah.

SHELBY: It's worth discussing.

JASON: Okay. I've got this, and I'm just going to save it. Should I just run this and we'll start getting data?

SHELBY: Run it, and I think you have to hit your root path or something.

JASON: I'm going to run this.

SHELBY: We can see if it shows up.

JASON: Oh, wait. Am I in the right place? I need to run this server. I think I ran the wrong thing. Oh, yeah. I need to run dev. Npm run dev. What don't you like? Oh, I'm in the wrong okay. How about now? We've got local host 9999. Then I go to the right browser window. I should be able to open up the server here, and we should be able to do a subscription. That subscription will be for a message, and we can do the message name. Let's grab the author. Oh, it has required fields.

SHELBY: Even that should show up.

JASON: Field name. Oh.

SHELBY: Let's see.

JASON: Channel. How about this? Is this going to work? There we go. Now, if y'all chat, it should show up here. This should be the stream blitz service running. If I collapse this error, there it is. This is the running service. What you're saying is now I should be able to actually see data in Honeycomb?

SHELBY: Uh huh.

JASON: Okay. My backend is here.

SHELBY: Go ahead and click on datasets. Hmm.

JASON: Let me pull this off for one second just to make sure that I have the right data. Waiting for data. Okay. Did I do something I clearly did something wrong, but it's unclear what thing. I've also done all sorts of weird things with this app, so it is possible. My app is here, but then I use Apollo server to it gets applied as middleware to the Apollo server, the app does. We're actually doing a server dot listen instead of a standard thing. Maybe I'm out of the beeline happy path here.

SHELBY: Huh. That's certainly possible. If we want to just get data to show, we can create a custom event just somewhere that's going to definitely run when you restart your app.

JASON: Okay.

SHELBY: It would probably look like that. Yeah, beeline. Start span.

JASON: Okay. Then to get that in I need to do a beeline. All right. Up here, can I just do this? Or do I need to include it again?

SHELBY: I think that should be fine.

JASON: Okay.

SHELBY: Yes.

JASON: Then once I have that, I can do this beeline withSpan. Let's have it run. We'll say, hello chat. We can do any arbitrary whatever?

SHELBY: Uh huh.

JASON: Okay. We can do something like return boop. Just return some text. Now that's running. It rebooted for us. Theoretically speaking, that should do the thing. Let me reload the page out here. Did you do anything? My redirect is not doing anything. Is it getting data? Hold on. I'm wondering if it is not loading the environment very well. Honeycomb API key. Did I typo this? Honeycomb API key. It is running.

SHELBY: Uh huh.

JASON: So, that's good, but it's not sending to Honeycomb. It's not doing that because why. Don't know. Am I using the right API key?

SHELBY: Did you copy the one you disabled?

JASON: It's entirely possible that I did. Let's stop and restart. Maybe I'm sending it to the old key. Let's try this again. It says we're waiting for data. Okay. Here's what we're going to do. I think I'm just going to roll this API key at the end. Let's bring this over so we can see what's going on. I'm going to make sure that the API key is coming out at all. We'll just log that. Process dot m dot home API key. Let's check here. We do get an API key, but it is still not showing us data.
Does this return anything if I try to log out the result or anything?

SHELBY: Yes. Although, I'm not a node developer. I was hoping I could lean on you for that part. I'm pretty sure the beelines will return

JASON: It returns the actual thing, like what we returned. That's good.

SHELBY: You might have to start and end the trace. Because node behaves differently from Ruby or Java, some of the beelines you have to when you create custom events, you have to tell it to finish the trace or finish the span. You might be opening the span and not closing it.

JASON: Oh, okay. All right. Let's try that.

SHELBY: Start span. Oh, yeah, you have to do beeline dot finish span, I think.

JASON: Then if I do a beeline dot finish. Start span and then dot finish span. Oh, dot finish span. Oh, okay. I get it. I get it. I'm starting a span. We'll just log that. Then I can finish the span.

SHELBY: For those in the audience who don't know what spans or traces are, spans are the you might need to start a trace, finish a trace, and then are you doing a withSpan or start span?

JASON: I did a start span based on the example that was here. I missed the part where it said finish it, so that's my fault. We've got a span. We've started and finished. Did we get data? Oh, no.

SHELBY: No. I think the span needs to be inside a trace.

JASON: Okay. Trace equals beeline start trace. Do I just start it, or do I have to put something in it?

SHELBY: You can add arbitrary trace level fields, but I don't think it's required. This will make a lot more sense once we see the visualization.

JASON: Finish trace. Trace. Right?

SHELBY: I love this. Yeah.

JASON: Finish trace. Okay. Now we've started a trace. We did a thing. We finished the trace, and now do we get data? Okay. Do I need to

SHELBY: Someone in the chat said your subscription events don't end.

JASON: The trace part though

SHELBY: The trace part starts and ends.

JASON: I can stop it and see if it makes a difference.

SHELBY: Yeah. Should have studied the node a little bit more. This morning, I tested this out on an example app and it worked great.
(Laughter)

JASON: The instrumentation name. Oh, do I need to tell it dataset is optional. If you do not specify dataset, it will default to Node.js. Enabled instrumentations. Oh, do I need to do this?

SHELBY: I think they're enabled by default.

JASON: If you want to disable automatic instrumentation

SHELBY: Yeah.

JASON: Okay.

SHELBY: For the app I did this morning and I might just have you try that because I don't know your GraphQL setup enough to help you, but I did the quick install and I got data in.

JASON: Okay. Let's go ahead and do that just to make sure that we've got something here.

SHELBY: Yeah.

JASON: Where would I find that app?

SHELBY: It's the real-world node express it's not on Honeycomb. It's real world node express. This one.

JASON: This one?

SHELBY: Yes.

JASON: Okay.

SHELBY: That might be a fork of it.

JASON: Thinkster?

SHELBY: Yeah.

JASON: That's the one you want?

SHELBY: Yeah.

JASON: Sorry. I'm between two computers. The chat is not available on the one I'm working on. Okay. Let's get this thing cloned and downloaded. I'm going to do that over here. We'll go back this way, and we're going to get clone. We'll get into it with the node express. Then just open it up and take a look. We have an express app. Looks like we've got quite a bit going on. There's Mongoose and MongoDB in here. Hopefully, it doesn't think I have MongoDB. We'll find out. I probably need to read the instructions. How about that?

SHELBY: Yeah.

JASON: Npm install.

SHELBY: Yeah. Then npm run dev.

JASON: I think I have MongoDB. We'll find out soon, I guess.

SHELBY: Worst case scenario, we do have a play demo dataset at Honeycomb, but I always like being able to poke at the code and look at the tool.

JASON: Okay. I do have that. I probably need to start this Mongo container.

SHELBY: I don't think you need to start that. As long as you have Mongo running locally.

JASON: Okay.

SHELBY: Probably the container has some interesting data in it, like some test data.

JASON: Oh, okay.

SHELBY: This will get data into Honeycomb. There. It works. Let's add Honeycomb to this the way we added it to your other app.

JASON: I'm going to grab this. We can drop it right at the top. I'm doing it like it says in the docs. Okay. This is a Honeycomb beeline. We've got our right key in here. We have our dataset service name.

SHELBY: It's going to know about your write key because it is local to your entire environment, right?

JASON: It's definitely not. We've already blown this environment variable.

SHELBY: Yeah.

JASON: I'm going to have to roll it anyway, so I'm just going to put it straight into the code. We'll deal with that part later. Here's our API key in place. That's all running. If we look at which module is not found? Oh, I have to install Honeycomb. Okay. Let's npm install Honeycomb. Then we'll start it again. Okay. Listening on port 3000. Let's go hit port 3000. Oops. Not whatever that route it was.

SHELBY: Yeah, slash API slash something, but that should show up in Honeycomb. If that doesn't show up in Honeycomb, I'd be really surprised.

JASON: Okay. There's a whole route.

SHELBY: It's an

JASON: Because we didn't stand up the MongoDB Docker image, it's not here, but that's totally fine because it is not for our purposes.

SHELBY: May need to refresh.

JASON: Maybe I set up my API key wrong because I was clicking buttons and not thinking about it very hard. Let me create one.

SHELBY: Oh, you disabled creating data.

JASON: Wait. Did I?

SHELBY: I don't think it said create datasets.

JASON: Oh, what? Okay.

SHELBY: Oh, my gosh.

JASON: Everything was fine. It's just me not doing things right. Okay. Maybe we should just try this one again and let's see if this happens. This is good news. This is because I don't know what I'm doing and not because something was actually wrong. (Laughter). Did we get data? Let's go look. Look at it happen. Oh, my God. Okay. We did it, everyone.

SHELBY: This is so much better. Yes.

JASON: Honestly, what we just did is we proved why I'm not allowed within 1,000 miles of DevOps.
(Laughter)

JASON: People are like, software reliability, get this guy out of the room.

SHELBY: I wore my dev oops engineer shirt today. I don't know if you can see it there.

JASON: Oh, that's a great shirt.

SHELBY: Yeah. Firebase sent it to me. I got it from one of their events.

JASON: Excellent. I am so happy that that was just me doing something wrong and not some bizarre bug. This came out of stream blitz and not out of that demo app that we made. We should be able to look at real data here. I'm going to stop that Thinkster app. We'll just hang out on this one. Now that I have some data, what would I do I guess, where do I start?

SHELBY: Yeah. In this view, this is the dataset overview for your stream blitz prod dataset. At the top, you can see the visualized group. That's your query builder. Below, those are the suggested queries for this dataset. Since you don't have a lot of data yet, we're not really giving you interesting queries yet to try, but here we can see overall distribution of latency, how slow are things you have set. Go ahead and click on that where it says heat map, the graph. Yeah. You should probably also close the sidebar, the dark gray sidebar, just to make it a little bit easier for people.

JASON: How do I do that?

SHELBY: If you scroll down it might be because you're zoomed in so much.

JASON: Oh, there it is. Yeah, I do have the site, but I'm on a small resolution too.

SHELBY: You see on the very right hand side those three tiny little boxes? Those are the events that you sent. For the heat map, it is sort of when something was sent for this period of time and how slow that was.

JASON: Okay.

SHELBY: I also like to do in the top in the visualize box, I'm going to have you type in something.

JASON: Type in where? Sorry.

SHELBY: Where it says visualize in the query builder. Scroll up.

JASON: Here?

SHELBY: Yes.

JASON: Okay.

SHELBY: Then you're going to type or just click on count. That's just the number of events. You can click run or do a shift enter, whatever is easier for you. Now you should see two graphs.

JASON: Oh, cool. If I run this, now chat events and stuff should go through. I'm curious if the chat events are going to get traced. That'll be really interesting if it just works. Chat, where are you when I need you? Here we go. Yes, you should absolutely be spamming chat. That is correct. Okay. We've got some data coming in. If I rerun the query, let's see what happened. We got a bunch of events.

SHELBY: Cool. Can you scroll all the way up and click where it says "last two hours"? Then just click on "last ten minutes." Let's just click on one of those. Any of those circle dot things. Perfect. That takes us through to a trace. This is a trace with a top level root span and node child spans. This tells me the way your app is set up doesn't use our auto instrumentation, which is probably the case. I've had this happen with the express app that I was doing, but it was because I was hitting the root path instead of the API slash articles or whatever. The articles looked great, and then the 404 from the root path was not really interesting. You scrolled down the fields in the sidebar. Those were all the context fields I was talking about earlier. We're already sending them with your HTTP requests. Things like user agent, things like request path, which are handy to query by status code.

JASON: I'm already thinking of things that would be super useful. For example, we can see this was sent over HTTP. We can see this was sent on local host. If I'm debugging production data, I would immediately strip all of that. If somebody is running a local server and trying to hit our APIs and getting weird bugs, they're not using our tools anymore, so I can kind of ignore that in my, oh, what a weird edge case. If you're hacking our system, whatever. Get out of our data. (Laughter).

SHELBY: That's actually a thing that comes up with a few of our users who have pen testers. They'll see the pen testers are just sending junk data. It will show up in Honeycomb as part of the fields. Okay. We need to take away that person's access or whatever. It's really clear that's what's happening. Yeah.
We can go back to the query view, query results. This is the trace view. That's the query results view. We can rerun and just see. A lot has happened since then.

JASON: This is the kind of stuff where I need to instrument this for real because I think I get odd time outs and all those good things. Yeah.

SHELBY: We can start adding deeper traces and custom fields.

JASON: Yeah, I would love to look at how we could do that, especially when we get into more custom data. For example, we have this TMI JS chat bot. It does something like we listen to connection states and stuff like that. You end up with like one client per connection. The way that I'm doing this now, I think it works. Not super clear if I'm duplicating this and if I'm creating a million TMI clients. The flakiness points to the fact I'm doing something wrong, but it's been very difficult to figure out what or why it's being flakey. This is being demonstrated right now on the overlays where Nikki's comment is there. I don't know why that one showed up and none of the other ones showed up. It's completely unclear why that happened and what's happening. (Laughter).

SHELBY: One thing I want to make sure first is that the top level trace is being generated for this part of the app.

JASON: Sure.

SHELBY: Do you know how to hit that and make sure it shows up?

JASON: This one definitely should have already happened.

SHELBY: Oh, okay.

JASON: Yeah. Where would we look for it? Can I do that through

SHELBY: If you scroll back up, I would just do "where" or even a "group by".

JASON: "Group by."

SHELBY: Looking through your fields, it might be this is the auto instrumentation. It would be hopefully something that's already been set.

JASON: Oh, yeah. I just did arbitrary stuff.

SHELBY: Oh, okay. Those are the events I thought it was also getting the chat events as well.

JASON: I think what it is getting with the chat events is whenever a request comes in through the let's see how it is going to work. When a command comes in, it will send off a request to load any commands that are registered with it. And then it will execute that command, which it sends out. I think we might be getting some GraphQL is getting data in and out through the subscription. I think that might be getting somewhat auto instrumented, but I'm not 100% sure.

SHELBY: Okay.

JASON: This is the part where I've kind of melted my brain because we're doing WebSockets and express and GraphQL and an IRC engine basically.

SHELBY: One way to check that is to see the way I would check that is to see if in Honeycomb all your events have that hello field.

JASON: Okay.

SHELBY: I would just group by field one and then hello.

JASON: Okay.

SHELBY: I'm not even sure if it is sending that part. It didn't like that. Go ahead and remove the custom trace you added.

JASON: Remove custom trace.

SHELBY: Yeah, from your code.

JASON: Okay. It's out of there.

SHELBY: Yeah. We'll see if new events show up from the auto instrumentation.

JASON: Okay.

SHELBY: That'll tell me whether we need to do a whole lot of custom stuff or just a little bit of custom stuff.

JASON: We have restarted this. I'm reloading the page. Let's run just a regular query and see what happens. I'm going to get the channel, and we'll get the description and the status and the stream. Just trying to get some of the data here. I need to actually include one. This is data about the stream itself. Here's where we are right now. That should have given us some data, so let's run it. Here are

SHELBY: There. Got some new stuff.

JASON: Looks like a bunch of new things happened.

SHELBY: Let's do a group by request dot path. Request dot query should be your GraphQL query. Okay. That's fine. Those are old analyses.

JASON: Maybe I need to run it again.

SHELBY: Yeah. Rerun the query. Yeah, go ahead and click on one of those, any of those events in the graph. We'll just look at the sidebar and see what fields show up.

JASON: GraphQL.

SHELBY: It is sending data because that was from a minute ago.

JASON: It looks like the query, I think, is maybe not getting auto picked up because I think this would just fail if that was the case.

SHELBY: Yeah.

JASON: Yeah, okay. Run that again. Go back here and run that query again. These happened at what time?

SHELBY: Just now.

JASON: Just now. This trace had multiple spans without parents.

SHELBY: Beautiful. Oh, that's perfect. What this tells us is that all of this happened in the same request, in the same trace with the same trace ID, but there's something a level above that that's not getting included in the trace.

JASON: Okay.

SHELBY: Unless it is doing something asynchronous. Sometimes when it is asynchronous and you click on the trace too quickly in Honeycomb, it just hasn't gotten all the data yet. There might be some earlier

JASON: Definitely possible that that's the case. It's not complaining anymore.

SHELBY: Oh, beautiful. Yeah. There you go. There's a little bit more interesting trace.

JASON: Yeah. This ends up going off to Twitch. That makes sense because we're hitting the API for that. We're hitting yeah, good. Look at it go.

SHELBY: Yeah. Can you think of you know where this happens in the code more or less. Can you think of any custom fields you would want to add to answer questions that you might have about this part of the app?

JASON: Yeah, we can do that with where's Twitch chat? Where do I actually send this? I need to get my Twitch data. I'm doing that in my schema probably. That happens here, so I would want to nope, not in there. Get Twitch. Okay. In here, we can track by which user. That would make sense because then we can see who is requesting what.

SHELBY: Uh huh. Let's see how to do that with the node beeline. It would probably look like for adding context to an existing span, beeline dot add context.

JASON: Beeline dot add context.

SHELBY: Yeah. Let's try that and see.

JASON: Okay. Up at the top, I grab beeline. I guess we can put that here. Then I get beeline. Down with my get Twitch, I would want to do beeline dot add context. We can say Twitch username is username. That seems right.

SHELBY: Yeah.

JASON: Oh, no. I did it wrong. Beeline dot custom context dot add.

SHELBY: No, I think add context is right. The other is a trace level field. You can do either.

JASON: Oh, I got it.

SHELBY: Yeah, cool.

JASON: Okay. I've added context. Now we should get a Twitch username. Let's make sure I'll stop and restart it just in case. Let's go out here, and we'll rerun this channel. That did it. Then in here I'm going to go back this way, run the query again. Here are our new ones, I think.

SHELBY: Yeah, those are new.

JASON: Yes. Then we get our Twitch username. Twitch username. Aha! Beautiful.

SHELBY: Awesome.

JASON: Oh, wait.

SHELBY: A group by, even, group by Twitch username.

JASON: Group by Twitch username. This might have been working before, and I just needed to refresh the page.

SHELBY: Yeah.

JASON: Yeah, look at it go. This is really cool. If someone else hits this for a different channel and I think we can do this by looking at Twitch dot TV. We can see who is live right now. Let's look at this one and make sure it is muted. We will make a request for this channel. Okay. Then we can come back in here and rerun. Now we've got data for both. That's really cool.

SHELBY: Uh huh.

JASON: That's really handy from my standpoint because I'm already thinking of things that are going to be really useful for me because I have different parts of this app that I've built. It would be interesting to see what are the ones that are actually where are things failing because clearly something is failing with different parts of this, which is why my boop drop doesn't work anymore. I built a thing where every time someone uses the boop emoji it is supposed to fall from the top. That's started failing. Eventually, I've got to be able to fix it. As you were talking about before, maybe it is this. Let's deploy it and see what happens. I can already see how much this is going to help me.

SHELBY: Yeah. I'm really excited that the auto tracing worked for you because that is the experience with beelines and open telemetry with popular frameworks like spring and rails and express and GraphQL. It builds the traces for you so you can just go in and add fields like the one you just added that's relevant to the context of that part of the code.
What I love about beeline dot add context or add field is it is not more complicated than adding a print statement or something. You're already thinking about what your code is doing, so you already know, oh, this is probably going to be important. Yeah. That's really exciting for me. I'm glad to see you're already it's clicking and the fireworks are going off from different parts of your app.

JASON: Check this out. Here's a thing that I know is going to bother me. Can I put in arbitrary I guess I would put in whatever message.

SHELBY: Yeah.

JASON: Then I can say new channel created for or I guess connection. Drop this in. Whatever. We can do all sorts of things here, but I'm thinking already. This is an issue I've got, which is this is only supposed to run once, but I'm not sure if that's true. I can come out here and go to my subscription and then run this. Theoretically speaking, now when anyone chats, chat now that we've got somebody chatting, thank you for bailing me out on that we can come back and run this query. Now I want to group on message. Oh, wait. Has it not shown up yet? These are interesting too because I need to figure out they come in totally without other information, which leads me to believe these are the TMI events. What about this one, this long one?

SHELBY: You can also go to the traces tab in your query results. Then scroll down.

JASON: Ooh.

SHELBY: It will show you the shape of the traces and the ten slowest traces in your time window.

JASON: Okay.

SHELBY: By clicking on the little icon or the trace ID link, it's

JASON: Yeah. These ones are all like one thing.

SHELBY: Except for the top, which is span. There you're probably going to have to go in and add some custom spans to wrap around that part of your code.

JASON: Yeah.

SHELBY: But you've created a custom span before.

JASON: Yeah.

SHELBY: That should be pretty straightforward.

JASON: It's still really interesting. The fact that this is possible at all is really interesting to me. If I want to do a custom span now, I can go down to this part here. Adding spans, and it was this.

SHELBY: Yeah. Start span.

JASON: Go down in here.

SHELBY: What it usually looks like is the start span is before that part of your code and the finish span is after. It wraps around the entire block of code.

JASON: Got it. Okay. Then I can put this here. Then finish the span. This kind of does its whole thing. This isn't a great example, but we can put it in anyways. Do I have to start a trace?

SHELBY: No. If you saw the top level span for this part of the code and you just didn't see the nested spans, then it should already be inside of a trace.

JASON: Okay. Okay. I'm going to refresh this page. I'm going to run the query, and we're going to look for message. Still not there. Is this running? I think I might have just not started it. Okay. Let's run the query again. Now we've got some of these, so I'm going to refresh. See if I can get message.

SHELBY: Those are new.

JASON: Still doesn't like my it's also entirely possible that this part of the code is just straight up doing something weird because the whole thing happens this is the part I get confused by because I'm going from point A to point C back to point B and then looping around to point A again.

SHELBY: Yeah. I would say if we're not sure, then let's just start a trace around this span. We can do trace because it might not be picked up by the beeline automatically.

JASON: Okay. I'm starting my trace. Then beeline finish trace.

SHELBY: Your code isn't inside your span.

JASON: Oh, does it have to be inside the span?

SHELBY: Yes.

JASON: Oh. Okay. Okay. That whole thing is hmm. How am I going to do that? Let's see. Let's do this. This is just going to return whatever. The trace is basically like a pass through.

SHELBY: Uh huh.

JASON: Whatever I return before is what it will return if I wrap the thing with it?

SHELBY: If you look in the beeline docs I just want to make sure we're on the same page the adding spans to a trace or starting new traces section.

JASON: Okay.

SHELBY: We start a trace, and then we start a span. You see how a little bit below that yeah, add the fields. You're doing some file writing in your example. That's just where your code would live or the withSpan for asynchronous stuff.

JASON: Okay. I think I need withSpan because this is asynchronous. We're going to go beeline withSpan. Then I just return this. In here, I can I don't know how that is going to work, but we will cross that bridge when we get to it, so let's get this part out of here. Get this part out of here. Instead we need to finish this around the width filter, which is here. Width filter. We'll call this subscribe. Then I need to figure out how to get this part. Here's our variables. Then we create our chat bot. Good. Then we have our payload and our variables. Good. Then I did this in the wrong place. What did I get wrong now? Oh, boy. Payload and variables. Unexpected token, Twitch message. Okay. I'm going to get here eventually. We've got

SHELBY: Does it make sense why you wrapped the span around your code though? That's what gives you the start time and the duration for that chunk of code.

JASON: Yes, it does. It does make sense, and I'm just trying to figure out

SHELBY: Where the curly braces go and the parentheses.

JASON: I did it. That should have worked. If I look at this, we should see that it's done that. Assuming all went well, then I should be able to go here and run this again. Let me reload the page so we get a fresh one. Hey, that worked. If we come back in here and we rerun this, we see all sorts of new traces. Let's do I'm going to reload. I'm looking for GraphQL I know that you're here. Where did you go? Let's look at the traces down here. All of these are like the same thing. These are requests. These aren't my GraphQL things. I am unclear on why hmm. It might be GraphQL subscriptions do not play well with this, but I'm still super confused as to how this could return and function and not log anything. Do I have to end the span with a withSpan?

SHELBY: I don't know. I don't think so. I'm so used to living in Java and Ruby and Python where I'm rarely doing this asynchronous stuff.

JASON: Yeah. It might be I'm not remembering how node works.

SHELBY: I think trace withSpan is running.

JASON: It says withSpan, object, and then a function that returns. Okay. I still need to start a trace.

SHELBY: I think you might need to start a trace still.

JASON: Okay. So, let's do that. How do I do that? The async stuff is hard.

SHELBY: Uh huh.

JASON: I'm returning this object, which has the trace inside of it, and that's being exported. Maybe then, when I get my resolvers create resolvers. I'm calling that here. Maybe I beeline start trace and beeline finish trace. And I need to capture that. Whoops. This one.

SHELBY: The other thing I'm wondering is if the way you're adding the custom context, the field that you add, if you need that to happen alongside the code instead of at the beginning. Of your span.

JASON: Say that one more time.

SHELBY: Your beeline withSpan. I think your GraphQL operation field I'm not sure it goes there.

JASON: Should I add it as context you mean?

SHELBY: Yes.

JASON: Beeline add context.

SHELBY: Uh huh. Yeah. Let's try it out and see if that shows up.

JASON: Beeline is not defined.

SHELBY: Oh, that's over there.

JASON: And we're running. Refresh. Here we go. Start the subscription. You got a fan in the chat. What's up, Java grunt. Let's run this query again. We've got things happening. Let's see if we can get GraphQL. Oh, it did find span field this time.

SHELBY: Interesting.

JASON: Oh, wait. It found it, but it didn't use it.

SHELBY: Right. I think I have to do a remedial course in the node beeline.

JASON: Getting into subscriptions may have been the wrong thing to try right out of the gate just because this is definitely the most confusing part of building servers, when you get into any kind of realtime stuff. Probably this is just my fault for doing way too much stuff here. Maybe what I can do is move this trace down. Yeah, maybe that'll help because then it will run all of this stuff.

SHELBY: Yeah. You want to start your traces maybe I should have said that earlier. You want to start your traces at the earliest possible point when the code is running.

JASON: Oh, okay.

SHELBY: You want to cover all that ground.

JASON: Maybe what I'll do then is put the trace up here.

SHELBY: And then also your root span, you want it to wrap around that early point in the code because each span contains its children.

JASON: Oh, okay. Okay. Let me get this out, and then we'll get this out. I missed the thing. What did I miss? This part. There we go. All right. We'll take that part out. Instead of doing it this way, we will maybe just do it this way. The whole app now runs inside of this main span. I will finish my trace outside of it. All of that is async. I don't know why that is async. I don't know. Let's try it and see what happens. Oh, my goodness. We are so over time. We've got to wrap this on up. Let's do this. We have done this call. Now I want to see if GraphQL operation got what we're finding is that I have a very complicated thing to instrument here, but what we found that was really nice is for the happy path of doing standard GraphQL queries this works so well out of the box with no issues. We can just say Twitch username equals JLengstorf. We can run this query.

SHELBY: May need to expand it out, yeah.

JASON: Is it not quoted?

SHELBY: Yeah, remove the quotes maybe.

JASON: There it is. No quotes.

SHELBY: Yeah.

JASON: Then we can this just works, and we get this beautiful trace where everything immediately shows us what happened, where it went, what it took, and all I had to do to make that work was install the beeline. Everything else that we did today was me trying to instrument GraphQL subscriptions, which is async on top of WebSockets on top of all sorts of things.
With that, Shelby, where should people go if they want to learn more?

SHELBY: Follow me on Twitter. The Honeycomb website has lots of resources, especially the blog. I help run the Honeycomb blog. We'll be posting more soon about introductions to observability. You've inspired me now. I'm going to sit down and really learn how to instrument asynchronous node apps. I might even just do a stream where I try to do exactly what you're doing today on similar application or even on your code.

JASON: You do that on Twitch?

SHELBY: Yes. Just my first and last name, Shelby Spees.

JASON: Excellent. My shout out command is broken. Or wait. Is it? Did I just do it wrong? There it goes. At least something I built works.
(Laughter)

JASON: Shelby, thank you so, so much. One more quick shout out to our sponsors, Netlify, Fauna, Sanity, Auth0 for making it possible for White Coat Captioning to be able to do the captions for us. Check out the schedule. We have Monica Powell to teach us about web mention, which is really, really exciting. Next week, we have Ben. Next week is the Jamstack conf. If you have not gotten a ticket for that, it's free. It's fun. It's going to be amazing. It's going to be a good time. Make sure you get a ticket for that. Yeah, go get that.
Chat, as always, thank you for hanging out with us. Shelby, thank you so, so much. Chat, stay tuned. We're going to raid. We'll see you next time.