[This fragment is available in an audio version.]
Alone in a room with the customer’s CTO, I said “So, I spent this morning talking to your people, and it looks like
you’re using low-level generic tech for everything and managing it yourself. You could save a shit-ton of money by
using high-level managed services and get load-scaling for free.”
“Yeah,” he said, “but then we’d be locked in and the next time
our enterprise contract comes up for renewal, you can screw us.”
It doesn’t matter who the customer was, except that they were big, and they thought they were paying too much, and I thought so too. It’s about the starkest example I’ve run across of lock-in fear, which inevitably raises the subject of multi-cloud, hot stuff these days.
The Trade-off · It’s really pretty simple. All the Cloud vendors optimize for the case where you’re running your IT end-to-end on their cloud; that way, everything fits together better. Also, it’s in harmony with the natural software-ecosystem trend towards platforms; vendors love having them and developers are comfortable building on them. So buying into a single cloud vendor just makes life easier. ¶
But… lock-in. It’s not just a ghost story. Consider that the vast majority of the world’s large enterprises have lost control of their per-employee desktop-software spend. They pay whatever Microsoft tells them they’re gonna pay. A smaller but substantial number have also lost control of their database licensing and support spend. They pay whatever Oracle tells them they’re going to.
Microsoft and Oracle calibrate their pricing in a way that bears no relationship to its cost of production, but is carefully calculated to extract the maximum revenue without provoking the customer to consider alternatives. Also they design their platforms so that that migration to any alternative is complex and painful.
So I’d be willing to bet that many CIOs have heard, not just from their CEOs but from their Board, “We don’t want you coming back to us in three years and telling us you’ve lost control of your cloud spend.”
But the Public Clouds aren’t like that! · They really aren’t. In each case, the founders are still somewhat in the picture, bringing a vision to the table that extends beyond short-term profits. I have met these people and they genuinely obsess all the time about what can be done to make their customers successful. Anyhow, the business combines high margins with high growth, so you don’t have to squeeze customers very hard to make investors happy. ¶
Here’s an example: Last year, AWS dramatically reduced data egress charges, under pressure from CloudFlare, Oracle, and of course their customers.
But that’s today. Let’s think ahead to 2030 and zero in on on a hypothetical future version of AWS. First, there was that unfortunate accident involving one of Elon’s satellites and the Blue Origin rocket Jeff was on. And Andy, a multibillionaire via that Amazon-CEO package, bought out the rest of the Kraken owners, then put in a surprise grab and owns the Seahawks too. He’s discovered that being a sports mogul beats the hell out of testifying on live TV to Pramila Jayapal.
In this scenario, activist investors and big-time PE players have won a majority on the Amazon board and they don’t want any of their business units missing any chances to turn screws that maximize revenues. Data egress charges ratchet up every quarter , as do high-level proprietary services like Kinesis and Lambda.
A whole lot of customers have lost control of their cloud-computing spend; they pay whatever their AWS account manager tells them they’re gonna pay.
Benefits of going all in · So the fear is real, and reasonably founded. Which is sad, because if you go all-in, there are pretty big pay-offs. If you want to build a modern high-performance global-scale application, a combination of Lambda, Fargate, DynamoDB, EventBridge, and S3 is a really attractive foundation. It’ll probably scale up smoothly for almost any imaginable load, and when the traffic falls, so do your bills. And it’ll have geographical redundancy via AWS Availability Zones and be more robust than anything you could build yourself. ¶
You can decide you’re just not up for all these proprietary APIs. But then you’re going to need to hire more people, you won’t be able to deliver solutions as fast, and you’ll (probably) end up spending more money.
The solution spaces · So, practically speaking, what can organizations do? There are three options, basically. ¶
Plan A: All-in · It’s a perfectly reasonable business strategy to say “lock-in is a future hypothetical, right now I need richer apps and I need them faster, so let’s just go all-in with GCP or Azure or AWS.” Lots of big, rich, presumably smart organizations have done this. ¶
It works great! You’re in the sweet spot for all the platform tooling, and in particular you can approach nearer the ideal, as articulated by Werner Vogels in one of his re:Invent keynotes: “All the code you write should be value-adding business logic.” You will genuinely deliver more software with high business value than with any other option.
(Your cloud account manager will love you, you’ll get great discounts off list price, and you’ll probably get invited on stage at a cool annual conference, in front of thousands.)
Plan B: Bare metal · This is the posture adopted by that company whose CTO gave me a hard time up in the first paragraph. They rented computers and storage and did everything else themselves. They ran their own MySQL, their own Kafka, their own Cassandra, their own everything, using a number of virts that I frankly couldn’t believe when they first told me. ¶
They had big engineering and SRE teams to accomplish this, and had invented all sorts of cool CI/CD stuff to make it bearable.
In principle, they could pick up their whole deployment lock, stock, and barrel, move it over to GCP, and everything ought to work about the same. Which meant a lot to them.
They had a multi-year enterprise agreement which I can only assume included fabulous discounts.
Plan C: Managed OSS · Here’s what I think is an interesting middle-of-the-road path. Decide that you’re going to go ahead and use managed services from your Cloud provider. But only ones that are based on popular open-source projects. So, use Google Cloud SQL (for MySQL) or Amazon Managed Streaming for Kafka or Azure Kubernetes Service. Don’t use GCP BigTable or Amazon Kinesis or Azure Durable Functions. ¶
This creates a landscape where switching cloud providers would be nontrivial but thinkable. At least you could stay with the same data-plane APIs. And if you use Terraform or Pulumi or some such, you might be able to make a whole lot of your CI/CD somewhat portable as well.
And in the meantime, you get all the managing and monitoring and security and fault-tolerance that’s built into the Public-Cloud infrastructure, and which is getting pretty damn slick these days.
But, data gravity! · I heard this more than once, usually from someone who looked old and wise: “The real lock-in ain’t the technology, it’s the data. Once you get enough petabytes in place at one cloud provider, you’re not gonna be moving” they’d say. ¶
I’m not so sure. I worked a bit on the Snowmobile project, so I’ve had exposure to the issues involving movement of big, big data.
First of all, the Internet is faster than it used to be and it’s getting faster every year. The amount of data that it’s practical to move is increasing all the time.
But more important, I wonder how much data can you actually use at a time? A high proportion of those petabyte-scale loads is just logfiles or other historical data, and its business use is for analytics and BI and so on. How much of it, I wonder, is locked away in Glacier or equivalent, for compliance or because you might need it someday? People don’t realize how freaking huge a petabyte is. [Here’s a fairly hilarious piece, Half a Billion Bibles, from this blog eighteen years ago, when I first ran across the notion of petabytes of data.]
So if you were going to move clouds, I really wonder how much actual live data you’d need to bring across for your apps to get on the air. Yes, I acknowledge that there are scientific and military-intelligence and suchlike apps that really do need to pound the petabytes all day every day, but my guess is the proportion is small.
So, data gravity might keep you from moving your analytics. But would it really keep you from moving your production work?
Of course, if you did that, you’d be doing…
Multi-cloud! · Boy, is that a fashionable buzzword, type it into your nearest search engine and look at the frantic pitching on the first page. ¶
First, let’s establish a fact of life: More or less every large enterprise, public or private sector, is already multi-cloud or soon will be. Why? If for no other reason, M&A.
If you survey the application inventory of almost any big household-name enterprise, you’re gonna find a devil’s-brew of mainframes, Windows, Linux, COBOL, C, Java, Python, relational, key/value, and probably eleven different messaging systems. Did they plan it that way? What a dumb question, of course not. It’s the way businesses grow. Enterprises will no more be able to arrange that all their divisions are on the same cloud provider than they can force everybody onto Java 11 or Oracle 19, or get rid of COBOL.
So whatever you think of the lock-in issue, don’t kid yourself you can avoid multi-cloud.
War story · There’s this company I’ve been working with that’s on GCP but at one point was on AWS. They have integrations with a bunch of third-party partners, and those are still running on AWS Lambda. So they have these latency-sensitive retail-customer-facing services that routinely call out from a Kubernetes pod on GCP over to a Lambda function. ¶
When I found out about this I was kind of surprised and asked “But does that work OK?” Yep, they said, solid as a rock and really low latency. I did a little poking around and a public-cloud networking-group insider laughed at me and said “Yeah, that happens all the time, we talk to those guys and make sure the transfer points are super optimized.”
Sounds like an example from the future world of…
Multi-cloud applications · Which, I admit, even given the war story above, I’m not really crazy about. Yeah, multi-cloud is probably in your future. But, in a fairly deep way, the public clouds are remarkably different from each other; the same word can mean entirely different things from one platform to the next. ¶
Which means that people who genuinely have strong skills on more than one public-cloud platform are are thin on the ground. I’m not sure I know any. So here’s maybe the most important issue that nobody talks about.
People costs · Everybody I know in the tech business is screaming for more talent, and every manager is working the hell out of their personal networks, because their success — everyone’s success — is gated on the ability to hire. ¶
For a variety of reasons, I don’t think this is going to change in the near or medium term. After that I’m dead so who cares.
So if I’m in technology leadership, one of my key priorities, maybe the most important, is something along the lines of “How do I succeed in the face of the talent crisis?”
I’ll tell you one way: Go all in on a public-cloud platform and use the highest-level serverless tools as much as possible. You’ll never get rid of all the operational workload. But every time you reduce the labor around instance counts and pod sizes and table space and file descriptors and patch levels, you’ve just increased the proportion of your hard-won recruiting wins that go into delivery of business-critical customer-visible features.
Complicating this is the issue I just mentioned: Cross-cloud technical expertise is rare. In my experience, a lot of strategic choices, particularly in start-ups, are made on the basis of what technologies your current staff already know. And I think there’s nothing wrong with that. But it’s a problem if you really want to do multi-cloud.
What would I do? · If I were a startup: I wouldn’t think twice. I’d go all-in on whatever public cloud my CTO wanted. I’m short of time and short of money and short of people and all I really care about is delivering customer-visible value starting yesterday. I just totally don’t have the bandwidth to think about lock-in; that’s a battle I’ll be happy to fight five years from now once I’ve got a million customers. ¶
If I were a mainstream non-technical enterprise: My prejudice would be to go with Plan C, Managed OSS. Because that open-source software is good (sometimes the public cloud proprietary stuff is better, but usually not). Because in my experience AWS does a really good job of managing open-source services so I assume their competitors do too. Because these people care about velocity and hiring too, but remember those Board members saying not to get locked in.
So, who does that leave that should pick Plan B? I’m not sure, to be honest. Maybe academic researchers? Intelligence agencies? Game studios? I suppose that for an organization that routinely has to code down to the metal anyhow, the higher-level services have fewer benefits, and if you’ve got grad students to keep your systems running the “managed service” aspect is less critical.
I’d hate to be in that situation, though.
Then there’s politics · We’re heading into an era where Big Tech is politically unpopular (as is Big Business in general) and there’s a whole lot of emerging antitrust energy in the USA and Europe. Meanwhile, the Chinese autocrats are enjoying beating up on tech companies who see themselves as anything other than vehicles for building Party support. ¶
What’s going to happen? I don’t know. But if you’re a really big important organization and you feel like lock-in is hurting you, I suggest you call your nearest legislator. I suspect they’ll be happy to take your call. And make sure to tell your public-cloud vendor you’re doing it.