Veteran platform engineers share lessons learned, wish lists

Platform engineering is the new DevOps, but few companies have reached maturity; reps from early adopter orgs shared tips, caveats and a call to action at KubeCon.

CHICAGO -- Platform engineering maturity has become the aspiration of many DevOps shops, but most organizations remain stymied at a low rate of adoption for internal platforms.

Platform engineering is the practice of providing internal self-service platforms for application developers that hide the complexity of cloud-native infrastructure. It has largely replaced the notion of the full-stack DevOps engineer who is tasked with all aspects of building and running an application in production. Instead, platform engineering evolved as IT skills shortages revealed the flaws in the "you build it, you own it" concept, and had its unofficial coming-out party a year ago.

While platform engineering offers a more pragmatic path forward for companies where developers must quickly ship code to ever-more-complex infrastructure, it has its own stumbling blocks. At this year's KubeCon + CloudNativeCon, experienced platform engineers from early adopter companies such as Intuit, American Airlines, Adobe and Expedia described the benefits they've reaped from platform engineering, as well as the ways they have overcome its challenges.

The benefits discussed by platform engineers at companies such as American Airlines, which built its Runway internal platform over the last two years, included improved compliance and security, fewer duplicated efforts and more cohesive infrastructure management. The platform also led to a more consistent and efficient developer experience for the airline.

"The end goal and the benefit is to make our developers' lives easier through self-service and automation through a single platform, and not make them jump between multiple platforms to get their work done," said Karl Haworth, architect of the developer accelerator product group at the airline, in an interview during the conference this month. "Before Runway, developers told us it could take months to get an application to one of our cloud partners, and we reduced that to about 20 minutes, with security built in."

But that doesn't mean the platform, which is based on the open source Backstage framework created at Spotify and donated to the Cloud Native Computing Foundation (CNCF) two years ago, is always an easy sell, Haworth added.

"One thing that we have a lot of at our company is the whole 'not built here' mentality," he said. "So if it wasn't built within the group, they're a little fearful with that."

The art of growing developer buy-in

Before Runway at American Airlines, the status quo clearly needed improvement: Decentralized application pipelines came with deployment delays, and developers eventually "wanted no part" of managing Kubernetes environments, Haworth said. New IT management also helped push the company toward adopting a centralized platform. But simply mandating the use of Runway also wouldn't have succeeded at American, he said.

Instead, as many enterprises have discovered, shifting to a platform engineering approach requires IT operations teams to become product managers and marketers, attracting developers as internal customers.

"With new management, there's a focus more on architectural standards that are implemented into [application] templates, so users will be getting similar experiences all the time versus creating all these snowflakes, which also damage adoption," Haworth said. "It's definitely a challenge, but we've picked certain strategic pieces that allow us to bring the users in to explore and check things out naturally."

The Runway team began by replacing the previous ticketing system developers used to request Azure cloud infrastructure resources with the Runway self-service portal. This began with requests for Azure service principals in the identity management realm and Azure resource groups.

"That got [developers'] feet wet and in the door," Haworth said. "The benefit of being able to tackle those smaller items is then being able to build them into larger workflows. … If [they're] standing up, say, a database that's using Azure, that service principal and resource group creation [process] is reused."

Other platform engineers told similar stories about starting small during a panel session at the BackstageCon colocated event at KubeCon. Making platforms a hub to easily find documentation and manage compliance for applications were among the successful strategies panelists recommended. Above all, a good developer experience and speed of code deployments should be the priority, the panelists said, and developers must be shown the productivity gains the platform has given them directly.

The platform team at Expedia Group, brought compliance checks and production readiness workflows into its platform early on, said Guillermo Manzo, senior manager of software development engineering at the travel tech company, during the panel. Expedia Group was an early adopter of Backstage three years ago.

"[By] helping those teams scale their platforms first, they're able to bring compliance to the rest of the software catalog," he said. "That's how we [got] the majority of our adoption for Backstage … although the software catalog took a while to take off, that was my bread and butter."

Automated compliance checks were offered to developers as an onboarding tool under a program called Ship on Day One, Manzo said.

"We're able to offer to developers, 'Hey, if you use this template, you pass all these [compliance] controls up front,'" he said. "If you're able to build out a paved road process that gives you all that up front, you're saving not only yourself time, but you've saved all the developers that you're going to go reach out to time as well. … It's a great selling point."

KubeCon platform engineering panel session
A platform engineering panel session at KubeCon + CloudNativeCon 2023, featuring from left, Joe Natale, Discover Financial; Srinivas Peri, Adobe; Abby Bangser, Syntasso; Colin Griffin, Krumware; and Josh Gavant, Red Hat.

Platform engineering is a continuous process

Done right, platform engineering can be appealing to enterprises frustrated by the inefficiencies of a decentralized approach, but most organizations are still struggling to expand adoption. Backstage officials, for example, have estimated that most deployments are stuck at around 10% adoption.

Most enterprises are in the early to middle stages of a recently released CNCF platform engineering maturity model, said Josh Gavant, app platform solution architect at Red Hat, in an interview at KubeCon. Gavant is a technical lead for the CNCF Technical Advisory Group for application delivery and Platforms Working Group.

"Level two, where you're kind of operational but not necessarily at a product level, scalable -- we suspect that a lot of our customers are at that level," Gavant said. "Hopefully, this maturity model will motivate them to find a way to level three."

I've seen toxic platform teams that treat developers as the idiot users. … Culture is a big part of doing platforms right.
Nicholas HughesCEO, EITR Technologies

While true full-stack engineering skills are difficult if not impossible to come by, platform engineering is ripe for reverting to silos of developers and operations, with communication gaps and conflicts between them, said Nicholas Hughes, CEO at IT automation consulting firm EITR Technologies in Sykesville, Md.

"I've seen toxic platform teams that treat developers as the idiot users the way desktop support teams have traditionally done," Hughes said. "Like, you know, 'You're always screwing up your laptop' [becomes] 'Oh, you're always screwing up your application.' Culture is a big part of doing platforms right."

The Agile concept of continuous improvement is also indispensable for effective platform engineering, said Srinivas Peri, director of engineering for Adobe's Ethos cloud foundation platform, during a KubeCon panel session.

"[The scenario of,] 'You build the platform and then [developers] commit' will never, ever happen -- adoption means, 'Whenever you build it, they will come,'" Peri said.

For example, in 2015, the Adobe platform began with Mesosphere's DC/OS for container orchestration. In 2019, the platform team migrated to Kubernetes behind the scenes without disruption to users.

"But that's not just it -- with Kubernetes, if you keep abstracting and moving, you're actually making it slow, so then you need to get into GitOps," he said. "Now we are going into another phase where we have to continue that journey again, we have to build a new set of platform templates. So depending upon the maturity of where you are, it's an ongoing journey."

Platform engineers call for better open source interoperability

While culture and organizational friction remain the biggest hurdles to effective platform engineering, tools are also still maturing, from developer portal frameworks to the cloud-native open source projects that surround Kubernetes. The CNCF's Platforms Working Group is defining a set of tools, processes and services it considers key parts of the platform engineering maturity model, but there's still more work to do to make cloud-native tools more interoperable for platform engineers, according to another KubeCon panelist speaker.

"Our Kubernetes clusters have a lot of CNCF projects in them that all need to be updated, a lot of times in a certain order that potentially has problems," said Mel Cone, senior software engineer at The New York Times, during a media panel session at KubeCon. "For example, we use Cilium and Istio, and usually we have to upgrade those before upgrading our Kubernetes nodes. And it's really tough to know, 'What are the potential conflicts if one version is broken? Is there a certain order that you need to do it in when it's the end of life for this? Does that conflict with the end of life for that?' Ideally, [there would be] some sort of [CNCF project] that can maybe run a test to confirm that things are working."

Even at mature platform engineering organizations such as Intuit, where the platform team upgrades 350 Kubernetes clusters every month, the management of add-ons such as service mesh and logging during that process can be a complex problem to solve, according to Mukulika Kapas, director of product management for Intuit's Modern SaaS platform. 

Intuit got a leg up on this by acquiring Applatix in 2018, which helped it start creating robust cluster management tools such as Add-on Manager to thoroughly test Kubernetes-adjacent resources including Istio, Open Policy Agent, AWS ALB ingress controller and Fluentd before upgrading, but it's still a challenge, Kapas said.

"[Amazon] EKS now takes care of our cluster control plane but doesn't manage worker nodes, where we have all these tools, a lot of software running that they don't manage," she said. "We don't want to be in the business of upgrading. We want the cloud vendors to take it over, but we can't because of all the add-ons."

Groups within CNCF, including the Kubernetes long-term support working group, are thinking about these issues. But there's still work to do, said Emily Fox, senior principal software engineer at Red Hat and chair of the CNCF Technical Oversight Committee, during a CNCF governing board town hall meeting held at KubeCon.

"This is a muscle that a lot of our projects need to gain more experience in and unfortunately, they don't have a lot of end-user contributors to help guide them down this path," Fox said.

As an end user of the Falco threat detection project, Fox said she helped initiate a testing program in the past to ensure new Linux kernel releases didn't break the tool, but that process should be done more systematically across CNCF, rather than within individual projects, she said.

"We need to figure out how to scale it up and make it better," Fox said. "Right now, a lot of our graduation criteria is more about the resiliency and stability of the [individual] release, but we've not actually gone down the path of, 'What does the upgrade look like?'"

Beth Pariseau, senior news writer at TechTarget, is an award-winning veteran of IT journalism. She can be reached at [email protected] or on Twitter @PariseauTT.

Next Steps

GenAI risks, rewards arise for DevOps and platform engineers

Dig Deeper on DevOps

Software Quality
App Architecture
Cloud Computing
Data Center