Transcript of DOS 2021 Coaching Program Week 3 Kurt Andersen

all right well welcome to week three of the devops online summit coaching program we have with us kurt anderson from blameless kurt if you want just kind of maybe give a quick synopsis i know you've been on the the summit we've talked to you a couple times but just so people know about you and about blameless sure um i am kind of the head of strategy for blameless we are focused on building an end-to-end sre platform and delighted to talk about that but i think you probably want to talk about other things previously i was necessary with linkedin and have done reliability engineering for other organizations prior to the term uh becoming more widespread we'll put it that way so fee do you have any questions for kurt or any topics you want to bring up nope not not really yeah because kurt you do a lot of work so talk to us a little bit about i know you had mentioned you do strategy so and you've been in the industry for a while you've kind of seen the sre role kind of evolve so talk to us a little bit about you know how that's gone through your career and what what things maybe we should uh because i guess fee you don't do you do sre or are you more of a devops role or just to kind of come in the depth of droves so i'm just curious because uh what the difference between sie and devops and i think this linked back into our conversation last week tom with uh with andy yeah yeah so well that's that's certainly a loaded area [Laughter] so i'm glad to see you jump right into the meat of things um it's mostly a matter of emphasis um i'd say um the the two constructs are compatible but they approach the the space from different sides i'll say so devops tends to to approach things from the developers side uh moving toward production um sre starts with the customer that's using the product the platform the site whatever it is and moves back through production and says what do we need to do in order to make that customer experience awesome um and and we kind of roll that up into reliable but that's an oversimplification obviously they they mash in the middle because you can't do everybody's part of the story as we say at blameless reliability is a team sport you need all people pulling their weight to to get it over the finish line and give that awesome experience to the customer so what are your like like one or two thing that you do day to day uh as part of the sie rose um well as much as i'd like to it to be otherwise responding to incidents is a is a pretty common aspect but then to counterbalance that it's a matter of thinking about how your system can is used by the customer how it serves the customer and how to architect it in a way to mitigate problems um so ideally you you structure your system so that you have as few incidents as possible while at the same time recognizing that incidents are inevitable and so you you want to have the skills implies to respond to incidents quickly and effectively my ceo was showing a a gif earlier today about the five second pit stop i don't know if you've seen it but it is truly amazing how this group of 20ish people swarm the formula one car and in five seconds they've changed all four tires refueled it and it's out of the pit that's the kind of solving incidents quickly and efficiently it takes a lot of coordination a lot of practice for all those people to get everything just dialed into the fraction of a second we're not talking about exactly the same thing in in i.t incidents but it's still having people have the facility with the tools with each other with a collaborative psychologically safe environment where you can respond to the incidents that are going to happen but at the same time you try to do everything you can to prevent them in the first place by well architecting your system yeah so you just talk about tools right so what kind of tool that helped the sie in blendblast to uh to uh to restore this like a few seconds uh to resolve the incident um well um in blameless what we have chosen to do is focus on the the response orchestration is what we call it so a lot of people use um slack so we started with slack um as a chat bot because a lot of people just spend spend their days in slack or email but we we focused on the chat bot as the place that where you can kick off the incident you can declare the incident and then you can coordinate the work of all the responders we build in the incident command framework as it's been sort of adapted into the it space from emergency response so you've got an incident commander who helps direct the crew and then you can build out greater or lesser complexity depending on the magnitude of the incident where you can dispatch individual teams into their own avenues of work to investigate and remediate or you can just have a simple three people together and and address the issue and be done with it and then capture all of that uh as it goes through the the process of resolution so that it feeds forward into your retrospective process rather than having to go back and and piecemeal it and cut and paste it from a bunch of different places to to build your timeline and and talk about how what you can learn from the process that you went through so that's that's the that's the starting point uh tooling wise of what blameless did and then what we've done just uh recently the beginning of this month oh that was just last week uh we gave our slo uh product uh time flies when you're having fun yeah um and um no i guess it was two weeks ago yeah we announced the ga of our slo product uh for slos or start with the user journeys what somebody's trying to do on your product maybe buy uh buy a toy for their cat or something like that or buy toilet paper if it's just a pandemic right and and you track you identify indicators that say whether or not they did that successfully ultimately you want you want your indicator to strongly correlate with the customers happy the customer is not happy and then you can based on that indicator you can say okay what fraction of the time are customers happy and then you measure that you observe what's going on and you look for ways to improve the happiness score too an appropriate level of uh reliability uh because um a hundred percent isn't the right answer it's gonna cost you an infinite amount of money to have 100 reliability so you want to balance the trade-offs of what is it going to cost to achieve a certain level of reliability that's true yeah so um sort of i lied that i don't have sie so we do have slack channels and we do have the what we call production support and uh admin support and uh we do leverage slack as well and uh you know slack is our uh main focus on you know uh everything right if customer report by a problem let's go through slack and uh we love slack so much stuff because just because you know like six months from now if there's a problem and then somebody said oh i think this happened before so we can just go back to the slack and here we go the whole community piano is there and then we can you know yep so that that that's nice yeah so and then there's other tools i mean we're not well we're trying to be a an sre platform we're not trying to replace everybody's monitoring tools their observability tools their ticketing tools but we want to integrate so we integrate with jira or we integrate with monitoring tools for feeding some of the measures that you use in your sli's come from existing monitoring tools we're just trying to put it into the context of understanding the system and then it bubbles up you can do it on demand or build charts of different areas of your product and how is the reliability looking for those areas you can go to whatever level of of administrative hierarchy you need to for the their point of view that they care about so i thought debugging uh issue on the production environment do you have any advice for us at the sre people advice on wise and you know techniques and you know stuff like that you use feature flags i do it's funny you you you you you're asking yeah we do have feature plaques and i'm not you know a part of the depth of size we continue the deliveries feature and uh you know into production and then everything's not ready uh the flag is turned off and then whenever it's ready we turn the flag on yeah yeah i think honestly i mean the good practices good good devops practices really help so if you've got a strong deployment pipeline where you can deploy and undeploy um or or roll back um or you do blue-green deployments um different strategies apply um to to accomplish the same idea that if you put something out there that's bad you want to minimize the time period that it's going to affect people so feature flags do that deployment agility i'll call it that does that um and then good monitoring i mean make it easy for your developers to instrument their code to stream the metrics to the right place without them having to think about it every time it's instrumented code is should be something that developers don't have to think about except to the degree of hey i'm doing something unique and how do i effectively expose that but once they decide the the measure that they want to kick out or the logs that they want to kick out use structured logs that's another big thing if you can um and and then collect that data that's that can be different teams do that in different organizations um but those are the kinds of things that i think are those are kind of underlying good practices and good skills that that help everybody yeah yeah so um we do have the uh blue green deployment so we on aws by the way um we use copy and code deploy um and then the pipeline we use code pipelines and uh we use like uh fully c i c d so when the pipeline is done that's the code that gets deployed into production right way but how quickly can you roll it back if it's bad uh if uh if it's bad uh we can grow a few things that we can do so the band we have like four or five type of apps that we uh we leverage right so aws lambda uh rollback mean that we uh there's voicing system for uh lambda but uh rollback we we're confident on the build so we just deal with the the the previous version rebuild with the brief version uh for uh the application that's running on the ec2 instance rollback mean that i redeploy at the previous uh uh code deploy um and then we're proving some other uh you know note module and stuff like that it's just redeploying uh uh rebuilt uh so like really rollback i we we're not gonna have like a very uh you know strict like drawback into uh you know previous version we just move forward okay so how long does it take though i i was reading a twitter thread um the other day that was making the argument that you should strive for um strictly less than 15 minutes from code commit to production and uh ideally even more like 10 or less well we're shooting for that 10 uh 15 minutes but right now my pipeline the whole part takes about two hours and then uh we're trying uh because uh you know i you know when when the project just started right so of course you have a few more a few tests than you know when the project more mature um so we are looking to look into a tool called launchable so that launchable is allow you uh because what we identify is like most of the time when your test end-to-end test run uh most of the time like 80 or 85 percent of them like wasted right you you get one live code and then you run the full uh suit of tests and it's all up though 85 90 of the tests are wasted right right so launchable allow is running ai on your previous test result and then suggest that you know uh yeah you you generally one live code then this is uh you know like 10 percent of the tests that that you can run to uh to be confident that your code uh you know uh will work right so let's try to use that um but right now our pipeline from from from the time that the pr get merged into github until uh when it deployed to production it's about two hours okay i hadn't heard of about launchable before that's okay but yeah yeah yeah we have a long talk with this and tom probably remember uh the guy who will create launchable is the also the same guy who creates jenkins oh okay yeah um so i swear the the tool that you use for our monitoring and instrument the code um so we talked about this like last time with like dynatrace and then data.com so i wonder uh because uh you have you use any other tool that's different um for monitoring um well i mean the the big ones at least the ones that we integrate out of the box with our new relic datadog um there's some older enterprise companies or or more established enterprise companies that might use something like app dynamics um they they have their own quirks i'll say each each of those have their own quirks um and then there's kind of the newer observability platform folks like lightstep that just got acquired by servicenow and honeycomb uh and there's a open open telemetry is the framework but there's another open source solution um i can't remember the name yeah open telemetry is i uh work with the whole ecosystem with graphana and um yeah open telemetry is sort of the the communications protocol um but there's another like competitor to light step but it's an open source solution and i can't remember the name of it right now i think um so okay okay yeah so this is forum is really good we at kinston you know uh tool that we're using yeah so i learned a lot from this i'm glad so yeah so i want to come back there kurt you you kind of started off i know fee started us off with a tough question talking about sre and devops and how they're different as you kind of look at those roles and i i appreciate kind of how you kind of start it with you know bringing one's the devops from the developer side and the sre is looking at more from the kind of the customer experience the customer journey is there a different um mindset or it would i would seem like there would probably be a different mindset for each and then probably some different skills as we look at those like if we compare those two roles or or maybe i just wanted to ask you that and see what you think or maybe maybe there's more overlap than i'm anticipating well i i think it trying to figure out the right way to phrase this um i hope i don't i'm not trying to get you in trouble kurt no no no that's okay um it depends on how um how narrow or or broad your perspective is or or how short or long-term your perspective is depending on how you want to phrase it okay so if you if you look at some of the work um for instance like gene kim promotes around devops and value streams um again they're they're coming from the developer side and looking at how do you get value to the customer in that in that directional sense they're concerned with more than just good software development lifecycle practices more than just continuous integration and delivery practices but but don't necessarily emphasize a lot on the getting the value into the hands of the customer um but they do they do have the perspective that that's important um and but but a lot of devops practitioners are are concerned with um getting the code from the developer into production and and basically they focus on the ci cd pipelines and they they kind of bound themselves in that space um sres um again can tend to be sometimes they'll get into this um i i call it an anti-pattern but they they can tend to get into a guardians of production uh mindset um where it's like the the customer is is king and we dare not interrupt anything for the customer um it's it's it tends to be a little bit more rare but that could simply be because sre hasn't been around for a long time um and it's again it it depends on the breadth of of perspective that if they're looking at baking reliability in from the point in time when the product is conceived and architected and they're working with the development team all through the development cycle then they're less likely to be in this sort of production mind you bad mindset yeah i can see that how those anti-patterns can kind of play out as we kind of change where our focus is yeah exactly so that's um i guess that's how i view it i mean yes there's some different different perspectives i do think um srvs tend to be uh system thinking the the stronger sres tend to be more system-wide thinkers um at the same time they can they can dive deep and they tend to be really good at troubleshooting which is because they tend to be generalists also it makes it really hard to write a job description hey i want an awesome troubleshooter who's a generalist and can think in system terms like this doesn't doesn't make a compelling job description and how do you evaluate people for something like that even right so and there's not a lot of great training in the field any whether it's formal or informal education yeah um it's just not when we did intern when i was involved with the intern program hiring well signing up interns i guess hiring them for the season uh at linkedin uh we had some challenges trying to find craft the job descriptions in a way that made sense and that we could find the people who actually would would enjoy and excel in the sre field it was easier to write a software developer uh intern sort of description yeah yep away yeah uh so it's hard when you in the job description you say okay i need the someone to debug you know somebody else programs right [Laughter] i read an amazing article some time ago and i don't remember who it was but it was this group of people and all they do is they spend their time refactoring old software programs and they basically get called in as specialists to say hey everybody's left that knew anything about this program can you fix it for us and and the people who worked for this company really really enjoyed doing that but it's also a pretty specialized skill yep i i we yeah yep well uh one one quick last question i guess you you talked about instrumented code um and we talked a lot about different tools at a high level instrumented code is it is there some general guidelines or does it more determined by oh we choose tool x then how we instrument our code is going to be dictated by that tool um the good thing about the the open telemetry spec that's come out is that it's helped to solve some of the the narrow focus like if you go back to the zipkin and or jager days what you did for one wouldn't work for the other and so you really had to decide what your consuming system was before you did the instrumentation now with open telemetry you can pretty much use libraries that are mostly off the shelf i'll say to emit the the relevant details and figure out separately the question of how you're going to consume and surface them to people so i i do think it's improved i think having it opens back i've been involved in in with the ietf and and open standards for a long time open telemetry is not an ietf spec i will say that much um but uh i do believe strongly in in the value of having common protocols because they they enhance interoperability yeah well kurt i really appreciate you uh sharing a little of your time and and kind of shedding some light on some of these questions that fee and i had and appreciate you joining us again yeah thanks for having me nice chatting with you gentlemen thank you bye guys thank you bye-bye

DOS 2021 Coaching Program Week 3 Kurt Andersen

Share your thoughts