agsandrew - Fotolia

Ticketmaster unifies DevOps monitoring with Confluent Kafka

Ticketmaster troubleshoots and secures distributed applications using Kafka stream processing as a central hub for data collection.

Confluent Kafka stream processing gives Ticketmaster Entertainment Inc. a cohesive system for data collection from distributed applications, and simple integration with DevOps monitoring and security analytics tools.

As the ticket sales and distribution company began a move four years ago from monolithic applications to a microservices architecture as part of its DevOps transformation, it discovered that data management would become more critical and more difficult.

"To know what is going on at the individual [transaction] layer with a great deal of precision requires integration with every one of these systems for each and every project that you do," said Chris Smith, vice president of engineering and data science at Ticketmaster, based in Beverly Hills, Calif. Some IT monitoring systems were set up to aggregate data from applications spread across clusters of virtual machines, but monitoring a more granular microservices infrastructure required visibility into each specific component that wasn't achievable using traditional methods, Smith said.

"When you've got n technologies, and for us is pretty big because we've been around for 40 years, [monitoring] becomes an n-squared problem," he said. "Data wasn't a centerpiece of technology strategy 40 years ago."

Kafka stream processing sorts jumbled data

Rather than attempt to integrate all of the components of some 300 applications hosted on more than 4,500 virtual machines between the company's on-premises data centers and AWS cloud infrastructure,  Ticketmaster set up a centralized data lake for application telemetry, and fed it using a set of Apache Kafka stream processing pipelines.

To know what is going on at the individual [transaction] layer with a great deal of precision requires integration with every one of these systems for each and every project that you do.
Chris SmithVice president of engineering and data science, Ticketmaster

In the Kafka system, supported in Ticketmaster's deployment by commercial vendor Confluent, Inc., applications send streams of data to repositories known as topics through Kafka's Producer API, and read them from a central Kafka server cluster using the Consumer API. It's a distributed version of application message brokering systems that use a publication/subscription architecture to asynchronously share information. By contrast, however, the Kafka architecture involves less overhead than traditional approaches because it delegates read tracking tasks to consumers rather than using the central cluster's resources.

"Each technology [only] has to be connected to Kafka, and through that, they're effectively connected to each other," Smith said. Each application record stored in the Confluent Kafka system includes a timestamp, which helps correlate events and transactions consistently among various application components.

Confluent Kafka support eases growing pains

Confluent, which was founded by the developers who created Kafka at LinkedIn in 2014, supports open source Apache Kafka the way Red Hat supports open source Linux. Ticketmaster leans on Confluent support to manage its Kafka stream processing back end, which can be a complex undertaking.

"The thing that people often ask about, and the thing I always harp on, is the Schema Registry that Confluent provides," Smith said. Without it, changes to data streams potentially disrupt consumers attached to them, and fragility within the system can build up as a result.

"You can really get into a where you've got like tech debt, just killing you," he said. "Once you start getting to dozens [of systems], it really becomes a massive headache that can put you into operational gridlock, where you literally can't make any changes at all."

Confluent support was also key to untangling a poorly provisioned Kafka cluster as Ticketmaster's use of stream processing grew.

"At one point in production, we had well over 10,000 partitions on one of our four clusters -- there's not really a good justification for that," Smith said. "It was just a function of people not understanding how to size their topics and manage their software."

Correcting this problem, however, required massive amounts of data deletion. While Kafka federates data consumption by default, writes remain centralized in an Apache ZooKeeper data store, which was overwhelmed by attempts to delete most of those 10,000 partitions.

"[Confluent engineers] walked through with us how to stop all those deletions, so that we can redo them in a more throttled fashion, and ensure that nothing was lost and that no operational risk was created," Smith said.

LightStep, Vowpal Wabbit discern patterns in Kafka streams

With data centralized via the Confluent Kafka cluster, Ticketmaster's next step was to apply data analytics tools to troubleshoot IT incidents and secure its applications against rapidly evolving threats.

Initially, Ticketmaster used the open source Jaeger project for distributed tracing of customer transactions through its applications. Real-time data collected through Kafka meant it could use the distributed tracing system to prioritize ordinary consumers over those that would abuse the system.

"Basically, the bad actors abuse our system to make sure they have priority access to tickets," Smith said. These bad actors then resell tickets at a marked-up price for their own gain. "Having a holistic view of all the activity in the system allows us to build machine learning models that can combat that abuse, and more importantly, prioritize the fans' ability to get access ahead of those bad actors."

Eventually, custom interfaces built in-house for Jaeger proved too cumbersome to manage, and Ticketmaster enlisted distributed tracing vendor LightStep to analyze and visualize data.

"[Jaeger] was difficult and slow, not in terms of getting query results back, but slow in terms of humans figuring out how to ask the right question and get the right answer," Smith said. "[LightStep] wasn't just visualizing the data, it was navigating the data past that initial visualization."

For security analytics, Ticketmaster uses an open source machine learning engine written at Microsoft called Vowpal Wabbit to stay ahead of attacks on its systems. Here, the centralized and real-time nature of Kafka data streams is crucial, Smith said.

"[Attackers] learn from our behavior and adjust their behavior accordingly," he said. "Whatever [threat] model you have … 15 minutes later, they've shifted their behavior. So, you really need an online learning mechanism in order to react quickly enough to the changing environment."

DevOps monitoring, data analytics choices abound

Other enterprises have fed Kafka streams into AIOps software from companies such as Moogsoft, Inc. to apply machine learning for IT incident remediation, but Ticketmaster has data scientists in house who can design their own security analytics algorithms using the open source machine learning engine, Smith said.

Meanwhile, there are many data analytics tools available for companies that want to centralize DevOps monitoring data, from traditional data warehouses by IBM, Teradata and SAS Institute, to Hadoop distribution vendors like Cloudera, Hortonworks and MapR and specialists such as MemSQL, SQLstream and Striim. Confluent Kafka competes directly with data stream processing services from cloud providers such as Amazon Kinesis and Google Cloud Dataproc. Microservices users can also apply a service mesh to centrally manage IT security and telemetry alongside tools such as Kafka, though Smith said service mesh remains too complex for his taste.

"I'm trying to avoid having Envoy wrap all communications with Kafka, which I think is just going to make everything more complicated, and [therefore] create new risks to the operation of Kafka," he said. "I would rather just have direct exposure to [Kafka] brokers and leverage their content."

Next Steps

Kafka users detail real-time data benefits

Confluent goes IPO as Kafka event streaming goes mainstream

Dig Deeper on IT systems management and monitoring

Software Quality
App Architecture
Cloud Computing
Data Center