On this episode, Badri Varadarajan, EVP of product at CloudFix, and Rahul Subramaniam, AWS superfan and CTO of ESW Capital of CloudFix, dive deep into AWS Cost Optimization best practices for how to apply cost optimization principles when designing, configuring, and maintaining workloads in AWS Cloud environments.
On this episode, Badri Varadarajan, EVP of product at CloudFix, and Rahul Subramaniam, AWS superfan and CTO of ESW Capital of CloudFix, dive deep into AWS Cost Optimization best practices for how to apply cost optimization principles when designing, configuring, and maintaining workloads in AWS Cloud environments.
--------
"If you keep doing cost optimization, first it's good hygiene, second, you build up your cost optimization muscle. So organizationally, when it becomes a real problem, you can hit the ground running and take commensurate proportional measures, as opposed to just going from not worrying about all, to it being the only thing you worry about." - Badri Varadarajan, EVP of product, CloudFix
--------
Time Stamps
* (01:12) How Rahul and Badri got started with AWS
* (07:08) Recognizing the importance of cost optimization
* (17:13) Simpler ways to save and get results
* (20:59) How to balance cost vs ROI
* (27:32) Organizational playbooks on how to construct to cost
* (36:34) Key factors to making cost optimization projects successful
* (39:47) Takeaways about cost optimization
--------
Sponsor
This podcast is presented by Cloudfix
CloudFix automates easy, no risk AWS cost savings. So your team can focus on changing the world.
--------
Links
Connect with Badri Varadarajan on LinkedIn
Connect with Rahul Subramaniam on LinkedIn
Connect with Dionn Schaffner on LinkedIn
Dionn Schaffner:
Welcome to the podcast. So excited to have Rahul here today, as well as Badri. Badri, how are you doing today?
Badri Varadarajan:
Very well. Sunny in California.
Dionn Schaffner:
Awesome. Fabulous. I just have to ask, what made you decide to focus your entire career on AWS? I mean, how do you get involved in this. Rahul, tell us how it started. I mean, you could have been a rocket scientist, you could have been in some deep dark lab somewhere. Why AWS?
Rahul Subramaniam:
You're close. I mean, I did major in physics in my undergrad, so astrophysics would probably have been it, but before… Very early on in 2007-2008, I was basically grappling with a whole lot of infrastructure issues, and that's when I discovered AWS. And the fact that infrastructure was being made available literally with a simple API call, just blew my mind. And the more I dived into it, the more I used it, the more I interacted with AWS. I actually got into a position where I was breaking almost every AWS service as they were releasing it, which had them call me pretty much every week about something that I broke, and just interacting with the amazingly smart folks over there over the years just got me hooked. I was then part of every new service they created. I was involved in trying everything out that they came up with. And they just became an integral part of how we did business. So I think our entire business got very deeply integrated with AWS, as well.
Dionn Schaffner:
So you were the troublemaker is what you're saying. You were the one causing all the trouble on the outside, so they're like, "You know what, we got to get this guy in closer. So, let's bring him into the fold a little bit."
Rahul Subramaniam:
I think I'd like to phrase it as me being that advanced tester and we collaboratively made the services good to the benefit of both parties, so it was a win-win.
Dionn Schaffner:
That's great. Badri, how about you? How did you end up investing in AWS and cost optimization as what you are living and breathing all of the day long?
Badri Varadarajan:
I mostly blame Rahul for that. AWS was a bit of an acquired test for me. I spent the initial part of my career really working on the network edge, building infrastructure and algorithms that got deployed and run where you are, be it for connectivity or computer vision analysis and so on. And so, I figured, "If you can't beat them, join them."
Dionn Schaffner:
Okay. But let's talk a little bit more about cost optimization. There was a time before cost optimization, there was a dark time, we don't speak of it much, there was lots of work. Rahul, tell us what managing 45,000 plus accounts looked like without some way to automate cost and performance. Take us through the dark times.
Rahul Subramaniam:
Cost optimization, for me, with AWS, actually started way before we had 45,000 accounts. In fact, it started when we had one account, and I'm talking about early 2007. At that time, shockingly, AWS didn't even have IAM that everyone is familiar with, which basically allows you to manage all your users, access control and stuff around this. Back then you had one account. So, one username, one password, and your entire organization would have to be given that username and password to operate on it.
Dionn Schaffner:
Yikes.
Rahul Subramaniam:
Okay? And can you imagine that world back in 2007?
Dionn Schaffner:
Ugh.
Rahul Subramaniam:
And we had over a thousand engineers that we wanted to enable on AWS. So we had this big nightmare of a scenario where we had to give away our master username and password, which had all of our credit… Back then it was all credit card only. So, it had our corporate credit card punched in over there, and 1000 people could basically do whatever they wanted on one account. There was no tagging, there was nothing. It was basically… The setup was primed for absolute chaos.
Rahul Subramaniam:
And so, in 2007, I wrote our first system, which acted like IAM. And it did two things. One, it allowed users to log into a portal where they could request whatever resources they wanted. And it basically acted as a proxy to our one single AWS account. But the second thing that we found very soon was, people were just launching instances willy-nilly and never turning them off. And our credit card actually ran out of limits so fast that we would have to refund the card multiple times in a month, which is just absolutely crazy.
Dionn Schaffner:
Wow.
Rahul Subramaniam:
So, one of the first cost optimization measures that we put in place was, we asked every person to put in their eight-hour shift into the portal and we would automatically turn off or hibernate those machines that they had launched when it was not their peak working hours.
Dionn Schaffner:
Uh-huh.
Rahul Subramaniam:
If they wanted, they could go back and turn it on if they were logging in in some odd hours. But just that cut our AWS costs by 66% because you're only using it for 8 hours out of the 24 hours, right?
Dionn Schaffner:
66%?
Rahul Subramaniam:
Yeah.
Dionn Schaffner:
Wow.
Rahul Subramaniam:
That was the first cost optimization system I wrote back in 2007 and it's been a constant journey ever since. If you're spending money on a system, there's always room to optimize those costs.
Dionn Schaffner:
I like that because as we look at the cost optimization, we talk about that, it's like, is it just for big business? Is it something that even a small startup should be concerned with? But clearly, even if it's just one account, that's a great piece of knowledge to understand that across the spectrum, there's opportunities for everybody to reach some benefits from cost optimization. What do you think is the top thing customers, clients should know about that time in the dark ages? What do you wish everyone knew that you suffered through the hard way but you know now?
Rahul Subramaniam:
I think the thing that we learned very quickly was the foundations of why we made the big bet on AWS. And that is that they are innovating at a pace that is just so remarkable. Everything that we thought of as a gap was very quickly closed in a matter of a year or two. So if you were ever making the long-term bet, you wanted to bet on AWS, or you would bet on AWS because they have a track record of constantly working on customer problems and turning all these standard utility functions into amazing services that are just commodity services you can use with simple API, and whatever gaps are there, you can be assured that if you bring it to their notice, it gets solved. So with that, it becomes a no-brainer to make the long-term bet on AWS. And I think people are still doubtful today, but having lived through this over the last 14, 15 years, I have the experience to feel very confident about that long-term bet.
Dionn Schaffner:
And maybe Badri, you can help us answer this question. Back to cost optimization, some businesses just aren't paying attention to it yet, right? They say, "It's not a priority. We've got other strategic initiatives going on." What do you say back to those folks, and how do you get them to recognize the importance of cost optimization?
Badri Varadarajan:
In a way, I'm kind of sympathetic to that idea that cost optimization isn't a problem till it is, unfortunately. I think there is this great article about Dropbox, I think, where as long as the revenue was growing and the market was rewarding growth. In valuations, they didn't worry a jot about cost optimizations, but arguably rightly so. But then the problem is once that curve flattens, you don't want to go into panic and sort of get into whiplash and say, "Cost isn't the problem," and suddenly, the next quarter, it is the number one thing. "We don't care. We shut down all our innovation projects." I think you want sort of a wholesome way to think about cost optimization. If you keep doing cost optimization, as a matter of course, first, it's good hygiene. Second, you build up your cost optimization muscles organizationally. And when it becomes a real problem, then you can sort of hit the ground running and take proportional measures as opposed to just going from not worrying about it at all to it being the only thing you worry about.
Dionn Schaffner:
I like that. "Building the cost optimization muscles within the organization," love that. So when the time comes, you can flex big, you're ready. And to follow up with that, one of the other things we hear then though is, people are like, "Hey, yeah, we are working on cost optimization. We have an internal team. We've got some internal tools." How do you balance that challenge?
Badri Varadarajan:
One way to approach that is to ensure that you're not just doing cost optimization by listing a bunch of tasks. You want to sort of go towards a goal from doing this over time is just the proof that such a thing is possible, right? I mean, you want to do the four-minute mile. You want to understand that costs can actually be reduced in an organic way. And our framework to think of it is, if you're just starting with your cost optimization journey, you can get to 50 to 60% cost reduction. Every now and then, folks like Rahul work some magic and get 66% by doing one thing, but that's the exception, not the rule. I mean, you sort of, you will get to 10% by doing something simple. And then the next 20% ends up being a little bit more complicated. And then the last 30% involves a bunch of sprints, which go deep, but it's healthy for you to know, starting out, that it is possible. It's not a fool's errand. You will get there if you do it in a systematic way and choose your projects well.
Dionn Schaffner:
Well, how do you know when you're doing it wrong? How do you know if you're not choosing your projects well? What does that look like?
Badri Varadarajan:
What it looks like is quarter after quarter of potential savings. I mean, it's very easy for you to either hire a vendor or do it yourself and get an impressive looking report that says your cost can be lowered by 60%. The problem is that that's not realizable. All it really does is, every quarter, you feel bad about what you did not achieve last year.
Dionn Schaffner:
There's the money left on the table. Dang.
Badri Varadarajan:
That's right. I'm now reading this book called Switch, about organizational change. You don't actually want to just paint a big picture and not take the first step. It's healthy for you to sort of feed your reptilian brain by booking small victories. If you have a grand plan, never do anything, you're probably doing it wrong. You should be able to do organic incremental improvements.
Dionn Schaffner:
Rahul, I feel like you probably have some war stories about this.
Rahul Subramaniam:
Yeah. Early on when AWS had just started, I think it, like Badri said, it was possible to make a few changes that would get big returns because there was a lot of gaps in the services, and the infrastructure, and the API, and stuff like that. But over the last 15 years, AWS has built so much maturity around how they build up their services that getting those big savings by doing one thing is just incredibly hard. Early in the days of our cost optimization efforts, we ran into a bunch of scenarios, for example, migrating databases or moving over applications to serverless. So, a large number of applications that we acquired were primarily on-premise monolithic applications. And we tried to switch them all over to services like Lambda when Lambda first came around.
Rahul Subramaniam:
Lambda is not designed to deal with monolithic applications like the ones that we had. Right? And that meant that we were embarking on this major surgery on our applications, trying to replicate the same function in a microservices pattern. And suddenly, we had so much chaos that we just didn't know how to manage it. And we had a number of failed exercises like that where either, over a period of time, we completed maybe 20 or 30% of the functionality as we moved over to microservices based on Lambda. Or it was just a non-starter because there are certain things that were being done in a certain way that customers were comfortable with, and you would have to change the entire mechanics of it as it moved to the serverless world. Right? So, we just couldn't make all those dimension meet. So, we realized the hard way that those big bang approaches were very few and far between as the services matured over a period of time.
Dionn Schaffner:
Well, and so you tried a ton of cost optimization tools before deciding to build your own, right? So what were they missing that you thought, "I can do this better. Let me just sit down and put some stuff together." What were they missing and why did you think you could do it better?
Rahul Subramaniam:
Yeah. First and foremost, I already have the big burden of being responsible for almost two and a half billion lines of code that we own across all of the companies in our portfolio. And I had absolutely no interest in building a new product or building and owning a new code base. That literally was not the intention at all. Our default is to go and look for every tool out there in the market that could potentially help us solve this problem. So, we did just that. We tried out all the tools, but very soon we realized that there were three fundamental problems. Problem number one was that most of these tools ended up being visualization tools that were closing that gap on some of the visualization problems that AWS had. By the way, all of those are gone. Right now, if you look at AWS tools, they pretty much give you all the data you want, but most of the cost optimization tools that you find in the market today are still glorified visualization tools that literally take all the data that's in AWS and present it to you in fancy graphs. Okay?
Rahul Subramaniam:
The bottom line is it became our problem to go figure out how to realize those savings because you have to slice and dice all that data and figure out what the insight is, and then go figure out how to realize the savings.
Rahul Subramaniam:
The second problem with these tools was that none of them actually fixed anything. Even if they actually provided some insights or some recommendations, they really didn't fix anything for us. More often than not these recommendations like, "Hey, resize your EC2 instances," or "Why don't you move to a completely different serverless platform, because the per unit cost there is completely different." All of those recommendations, while great sounding and the recommended savings or potential savings was 50, 60%, just realizing that was incredibly hard because you needed to perform major application surgery to achieve even remotely close to those kind of savings. And we just couldn't get our teams to sign off on or be successful at those major surgeries that they had to perform.
Rahul Subramaniam:
And the third issue with a lot of these tools was that they were just insanely complicated. Just navigating a lot of these tools required almost like a PhD in AWS services where you wouldn't even know… They just kept slapping on stuff over their basic visualization that they started off with, and you wouldn't even know where to go look for insights even if they provided some insights. And a lot of these tools just got so overly complicated, requiring admin permissions to do literally anything, that it just became a no-go for a large proportion. So the admin permissions, complex UI, that became a big issue, as well. And because all of those hurdles were things that made a lot of these tools no-gos for us, we had to invest in figuring out a simpler way to realize savings, not just talk about potential savings.
Dionn Schaffner:
This is the where the rubber hits the road. It's great in theory, but how do we actually instantiate get these results?
Rahul Subramaniam:
By the way, there was a year that we spent trying out all these tools and putting together a SWAT team trying to get all the savings. We spent a few million trying to realize these big savings, but literally saved nothing.
Dionn Schaffner:
Well, that's part of the journey, right? And that led you to CloudFix. Okay, you get one minute to talk about CloudFix specifically. Well, you both do. So, Rahul, you go first, and then Badri, you're going to follow up with your comments on CloudFix. Ready? Go
Rahul Subramaniam:
Very simply, what we did was we looked at all the AWS recommendations that they had made around cost. We filtered them down to the ones that we believed were completely non-disruptive and that we could execute centrally. And that's literally what we did. And then, of course, Change Manager came at just about the right time, where we were able to use Change Manager as a mechanism to deploy those fixes without needing admin credentials. So, in effect, we were fixing the problem instead of just talking about potential. And we did not require admin credentials. We basically followed all of AWS's best practices and recommendations to realize those savings. And that was basically what closed all the gaps for us that we had with the other tools.
Dionn Schaffner:
Badri?
Badri Varadarajan:
Actually, I have nothing to add now. AWS [inaudible 00:19:33]. CloudFix is supposed to be the "fix it, don't talk about it," too. So, that's it. We're done.
Rahul Subramaniam:
Yeah. I mean, it's supposed to be the simplest tool out there. With five clicks, you save 10 to 20%. It's supposed to be simple.
Dionn Schaffner:
I like it. And so let's talk about the 5 to 20%. Let's talk about the money. You mentioned that you all spent a million dollars trying to get to this product. How much are you saving now by having this tool in your arsenal? Maybe you can use percentages if you're not going to talk real money.
Rahul Subramaniam:
Yeah. Our savings are a combination of what you see in CloudFix today and stuff that will be coming up in CloudFix, because we actually run all of these finders and fixes on our setup first before run it or deploy it for other customers. And we run it for quite a while. We literally measure ourselves on every week, "How much do we save on an annualized basis? What amount of spend do we claw back every week?" That's our metric, and we measure it in dollar terms, in concrete dollar terms.
Rahul Subramaniam:
And today, for our spend, which across AWS customers isn't actually very large. We still manage to claw back about a quarter of a million dollars a week, which is pretty significant, and this is, again, on an annualized basis. But I think, on an average, for customers, you could find that whatever be your AWS spend, when we try to buy a company, or when we are evaluating acquisitions, it's a simple assumption that we make, which is, "Long term, we can save 50% via this incremental mechanism, but short term, 10 to 20%…" is a given. It is something that we can absolutely go on.
Dionn Schaffner:
That's great. And if you think about in terms of the cost optimization spectrum, CloudFix is on the easy side. Press five steps, you're going to get this kind of return. And then who you've spoken before of, the really harder problems that take up more of an investment of the organization to sort of go and dig in. Maybe Badri, can you tell us maybe, how do you balance the cost of your internal resources, the risks against your expected ROI of getting these cost optimization results in-house? Is there, like, a magic number? Where do you find the tipping point for the business to decide, "Hey, yes, we are going to make this investment with our time, with our people, with our resources, and to really dig into this particular cost optimization problem"? How do we know when we should jump or when we should just wait?
Badri Varadarajan:
Yeah. That's a good question. I'd say I'd slice it into different tiers. There are about 10 to 20% of savings. That's the realm that CloudFix operates in, where your investment should really only be on the tool, not on people managing the tool. The tool should just basically do it. Then the next 30% you get by investing people as well. Essentially, those are the sorts of savings for which you need engineering teams to get involved. You want to ensure that your cost optimization does not affect the functionality of the product. So I'd almost operate that as an engineering project itself and look at the ROI in those terms, so you do need to account for what you're spending in terms of manpower there, how many hours are people spending, and what features are they not shipping because they're doing that.
Badri Varadarajan:
So I divide the ROI into those two different tiers. And I think one of the reasons CloudFix exists is this realization that you can actually get 10 to 20% without involving people and investing brain power in it. It's just a tool that just does its thing, and you don't need to do anything beyond click a few buttons and count the cash.
Rahul Subramaniam:
I'll add one more dimension to this. The non-disruptiveness of what you're going after is a second dimension in this. So I absolutely agree with Badri that the tool versus people is one dimension, but also look at the non-disruptiveness of the changes that you're trying to make. And that also, again, it's like a quadrant. So there are a bunch of non-disruptive changes that do require people assets. So for example, there's a bunch of financial engineering or just process-related stuff that can save you a bunch of costs. For example, if you're migrating a bunch of workloads from on-prem to AWS, you should absolutely have someone take lead and sign up for the Migration Acceleration Program, where AWS covers a bunch of your costs while you're migrating all your stuff so that you're not paying double during the migrations. That's a great ways to save a ton of money. Most people don't even know about the Migration Acceleration Program, but it's the easiest thing to get signed up for. All you need to do is tag your resources in a certain way, and you start getting credit from AWS for all of those workloads.
Rahul Subramaniam:
Another example is, if you have certain services where your consumption of those services are just insanely expensive, you absolutely should have somebody go talk to the product team and see if you can work out a discount or a volume discount with that particular product team, because AWS does that very often. If they find a customer using a particular service far more than anyone else, they will make concessions, and you can always go negotiate that.
Rahul Subramaniam:
The third one is, leveraging things like savings plans, and CRIs, and the reserved instances, that criteria. You need someone that you can dedicate to who can understand all of this. It's just a bunch of financial engineering where it's a trade-off between commits and discounts that you get. And you can do that and get up to 50% discount on your spend, depending on how much of a trade-off you're willing to make in terms of commitment versus the discount you get.
Rahul Subramaniam:
And lastly, there is the EDP. And again, I don't recommend anyone sign the EDP. I treat it almost like a handcuff around your hands. I mean, it's not something I recommend, but that is an instrument of last resort, as well, to get a certain amount of discount if you're willing to make commitments or if you're really, really, really sure of what your spend over the next three to five years is going to look like.
Rahul Subramaniam:
Now, those are all the things that you could do by investing certain people and getting those cost benefits. And they measure near zero on the disruption side of the equation.
Dionn Schaffner:
It really sounds like you need allies within your own organization sometimes to sort of come on board to participate in this cost-optimization journey. Engineers who are down in it every day, who are feeling the pain, how do they sell wanting to go do these cost-optimization projects, the organization, or sideways in the organization to get more folks like, "Hey, we're going to need some talent resources. We're going to need some business folks to really help us understand the business challenges of this," how do you champion this throughout the organization?
Badri Varadarajan:
Actually, that is one of the more important things here is. A lot of this, there's good know-how out there, particularly AWS publishes a bunch of these things. One problem is, they make it complicated, which is why those like CloudFix are needed. But the other thing is organizational buy-in. If you look at strategies that we have seen that work, they switch between the small, furry mammal strategy and the apex predator strategy for survival. You first get some small wins and sort of earn your stripes, and then the organizational people take you seriously. It's like, "Okay, hey, this project can save money. Let's invest more here." And then you go into those… I mean, you may not save a lot of money that way, but at least you've proven that such a thing as possible. And then you can switch to the apex predator strategy, which is, "We need engineering teams to care about this."
Badri Varadarajan:
Even things like tagging that Rahul was talking about, you need folks to buy in and go tag those resources. The one thing I'd recommend is to first earn your organizational stripes and then fight those battles. That's one thing.
Badri Varadarajan:
The other thing is just the process itself. Change Manager really make it simple because you don't have to get buy-in which involves meetings, and playbooks, and people reading off of the same playbook and following processes, and so on. That gets baked into the AWS console itself, which is a big thing. The AWS Change Manager team wrote a blog post on how we approach this. I was talking to the product manager the other day, and she told me that has been used by other organizations as well. So there are organizational playbooks here, as well, in terms of how you can structure cost optimization.
Rahul Subramaniam:
Yeah, I think, again, going back to what Badri was saying earlier, the two dimensions of tool versus people, it is in my opinion, a 100x harder once you get into that people investment domain. So, you want to maximize and get all your wins from the tools and the automation before you get into asking for investment of people, and time, and resources, because expertise is scarce, the resources are scarce. Whatever limited resources you have, you want to invest them in building features and building your applications, because they have the domain knowledge about your business. And that's where you want to invest that skill and expertise.
Rahul Subramaniam:
The second dimension, of course, also plays a role. Most people get scared when you start moving towards any sort of disruptive change, because they don't understand it. They fear what the impact is going to be. And they're more likely to figure out ways to reject ideas of change than be participants in it. So, you have to earn their trust. You have to earn your stripes, like Badri said, but try to get as many wins out of the tools first and exhaust all of those options before you start asking for people resources, building up knowledge amongst the people and allaying their fears about what these changes might mean for them. And that would be my approach.
Dionn Schaffner:
We do hear from our customers who are considering various cost optimization projects, including CloudFix, that security is really their struggles to participate, right? They're like, "Wait, you want me to give you admin access to all of this?" How do you combat that particular hurdle?
Badri Varadarajan:
Yeah, absolutely. Right. I mean, the security is a big concern and that's why you want to address it organically as part of every project itself. So, from a tool point of view, that was actually one of the key design considerations that went behind CloudFix. When you go into these people projects, again, there you want to ensure permissions, not only in security, not only in terms of who's allowed to access it, but I think you also want to limit what people can do. There are two dimensions to security, like, which tool or which person is allowed to do something and be what they're allowed to do. And there you want to ensure that you put in the right service control policies in place, you're putting other rules in place. AWS will give you these tools, but they do believe in just giving you tools and letting you build it yourself.
Rahul Subramaniam:
Yeah, absolutely. I mean, AWS has invested a lot in creating the well-architected framework which you can use to define your security parameters very well in the AWS framework so that you can be assured of exactly what a tool or people are allowed or not allowed to do. Unfortunately, for most organizations, security comes more as an afterthought. Like, they will first build the application, they will first do a bunch of stuff, and then they will think about, "Okay, now what should I do about security?" When I say "security," I also include permissions and schemes that may not be security in terms of somebody attacking your infrastructure or something like that. Security, for me, is also where you let somebody launch a bunch of instances without control or where you let people launch different kinds of resources which may or may not be auditable by the organization. So, best practice is setting up your service catalog, setting up policies, things like that.
Dionn Schaffner:
Let's talk about the people. Whose lives are changed on the daily by implementing some of these cost-optimization initiatives? Like, how does their life change before a cost optimization to after?
Badri Varadarajan:
I think the folks were happiest to run this in the CFO's office, right? Because it's good for them. But I think that's part of the challenge of this, is to ensure that you're making the CFO's office happy without affecting the CTO's office or the VP eng.'s job. And ideally, that's what you want to do. You want to ensure that all your projects have well-known blast radii in the beginning that affect as few people in the engineering organization as possible, and wherever it does affect them, the effect is contained, that they know exactly what they need to do, or some automation, or tool, or UI messaging will tell them exactly what they need to change in their behavior. And hopefully it's not [inaudible 00:33:15].
Badri Varadarajan:
I'll give you an example. You can put in a policy that says you cannot launch instances of a certain type because you want to protect yourself against Bitcoin mining attacks. That's fine, but you need to have a clear way of communicating that and ensuring that that does not affect operations that happen as a matter of course. And good practice there would be to, before you make a change, just do a dry run and see what all operations it could've affected in the last three months, and target messaging towards people who would be affected by those.
Dionn Schaffner:
Tell us about a blast radius that has gotten away from you, and sort of been the worst cost-optimization project. And what do you think was the key contributor to something like that? Rahul, you probably have some good stories.
Rahul Subramaniam:
There were times in the early days when we were just starting with our cost-optimization journey, we had a bunch of instances which were doing very little to the point that the resource utilization was near zero. And in one of the very early cost-optimization exercises that we did, we ended up shutting off hundreds of such machines, which ended up being fairly critical from an operations' standpoint. Thankfully, we snapshotted all of these instances before we decided to shut them off, as a safety measure. So, though there was disruption, we were able to bring all of these instances back, but had we not put that measure in place, we would probably have suffered greatly. I think one of the other things that historic, or it's been an interesting insight for me at least, is, over a period of time, as organizations evolve and change, I find it really shocking how little the organization knows about all of their infrastructure.
Rahul Subramaniam:
You have tens of thousands of machines and resources running all over the place. You would think that there's a perfectly auditable manifest of exactly what machine runs what, and you know when you're going to shut it off, or whatever. You'd be surprised as to how many instances are running because nobody knows what's in there and they're just petrified to shut it off, or EBS volumes that are running or that have been detached. They're just literally sitting there, massive discs of terabytes of storage capacity provisioned on them, but when you ask someone to go delete it because nobody has touched it in a year, they're like, "I don't know what that has, so I'm just petrified to shut it off. And I don't know anyone else who knows what that instance has." Right? So you'd be surprised as to how prevalent that is across large organizations. Especially over a period of time, people really don't have a sense of the inventory of infrastructure and resources that are running, and that's a big cause of a lot of waste and cost, as well.
Dionn Schaffner:
From the business side, it's the "Infrastructure is just there. It's running. We're not going to mess with it. We're not going to peel back the layers and see what's under the covers. All we know is it's working, don't touch it," right? But at some point, there is the ROI discussion of, "Okay, we need to tighten it up. We need to free up some resources, do some other things. Let's scarily open the hood, see what kinds of things fly out, and let's go and address that." And oftentimes you don't see organizations utilizing their IT infrastructure as a strategic advantage, right? It's just supporting the business. And so you have to have these moments of everyone around the table, we all come together and look and say, "Hey, this is really important strategically. We really need to dig in and make this work for us, not just keep us treading water evenly. How can this help us advance where we're going?"
Dionn Schaffner:
All right. So what are the key factors to making a cost-optimization project be successful? Badri, you talked about identifying the blast radius. What other things are critical that not just IT but the whole business needs to make sure in place, as we launch on a cost-optimization initiative?
Badri Varadarajan:
Who needs to be informed, like, whose lives get impacted, as you said earlier. To Rahul's point earlier, as well, that's a key fact what Rahul mentioned is, folks actually don't know what's running in their AWS infrastructures. And in fact, within the same team, different people don't know what others are doing. I mean, we've had both things happen is, a manager signs off on something because they're not aware of exactly what their DevOps engineers are doing. And you think this is something as a simple change, and it turned out to be a huge, big problem. To give you a concrete example, because you think you're just restarting an instance, it's going to come back within three minutes, but it's running some startup script that nobody knows about.
Badri Varadarajan:
And the DevOps engineer who knew about it is on vacation. And we never even asked him because, "Hey, this is just restart. Why do you need to ask anybody?" On the flip side, projects have gotten blocked for months together because the manager thought it was a much bigger deal than it actually was. And when the engineers got involved, they're like, "Ah, this is just a tag. I'll do it. I have an automation for this. I already have a way to address all these machines." So, figuring out who the stakeholders are for any given project is super useful. And that's also why you want to cut it up by project. You don't want to have this one big goal of saving 60%. You want to slice it into a particular application or a particular service that you want to optimize.
Rahul Subramaniam:
And I'll just add one more thing. I think having automation and tooling in place is really key. When you do manual one-off stuff, you are more likely to make mistakes and regret it. So, whatever project you take on, make sure that it is automation-driven because that's how you ensure that what you're doing is repeatable, not just for one resource but across your entire setup, and you're not going to make mistakes, because it's all enshrined in code. So, automation is, I think, key. No matter what kind of project you're delivering, try not to have manual steps in the process.
Dionn Schaffner:
Because people are always causing the problems.
Rahul Subramaniam:
I mean, in reality, if you did first principles thinking and said, "What is the root cause of all bugs?" Somebody wrote some line of code that caused a bug.
Dionn Schaffner:
Mm-hmm. Let's talk about the talent that we have at CloudFix. So you all are attracting some amazing talent there. How are you doing that? How are you bringing the best and brightest minds to come in and tackle this problem?
Rahul Subramaniam:
I think one of the advantages that we have is that we are literally working at the bleeding edge of cloud computing, where we are trying to stay ahead of the curve of this fire hose of AWS: Services, product updates, announcements. And that is literally the bleeding edge of computing.
Dionn Schaffner:
Well, and rumor has it that two out of the three first FinOps certifications are sitting at folks in the ESW capital organization right now. So you truly have the best of the best in this area. You get top two takeaways about cost optimization that you want our listeners to walk away with. Badri, you start.
Badri Varadarajan:
Cost optimization is possible, and you can do it incrementally.
Rahul Subramaniam:
Care about the dollars saved and realized. Don't go after the potential. And the second one would be, rely on tools and automation as much as you can, because the minute you start needing people to do a whole bunch of stuff, you just get slowed like crazy.
Dionn Schaffner:
Well, thank you both for this enlightening and arousing conversation. Badri, thank you. Rahul, thank you. It's been great time talking to y'all.
Badri Varadarajan:
Thanks.
Rahul Subramaniam:
Likewise, an absolute pleasure.
Dionn Schaffner:
Thanks everyone for listening today. If you enjoyed our podcast, please be sure to rate, review and subscribe. See you next time on AWS Insiders.
Dionn Schaffner:
We hope you enjoyed this episode of AWS Insiders. If so, please take a moment to rate and review the show. For more information on how to implement a hundred percent safe AWS recommended account fixes that can save you 10 to 20% off your AWS bill, visit cloudfix.com. Join us again next time for more secrets and strategies from top Amazon insiders and experts. Thank you for listening.