Rawpixel - Fotolia
Log monitoring is more critical and cumbersome as cloud-native applications go mainstream, but subtle changes to log management and distributed infrastructure tools can pay big dividends.
For many enterprises, log monitoring has gone from a borderline luxury to a must-have way to troubleshoot complex microservices applications and software-defined IT infrastructures. That's because developers must comb through system logs to troubleshoot subtle problems in application code, and as IT resources become more ephemeral, highly granular log data captures information about individual systems that coarser methods don't catch.
However, as applications move from monoliths that run on a single server or high-availability pair to microservices that rely on multiple layers of virtual computing, the volume of log data generated by IT systems can quickly become unmanageable. Trying to keep historical log data on hand for incident reviews, forensic investigations or root cause analysis compounds the problem further.
"You never log less," said Bob Zoller, founding engineer at Good Eggs, a grocery delivery service in San Francisco. "We approached a point ... where we noticed we were spending an awful lot on [log management] and had to decide how much value we were getting out of it and if the numbers added up."
The firm, which has about 30 engineering employees, saw its log volumes increase from 10 GB per day in 2015 to more than 200 GB per day by 2018, which resulted in a bill for log management and analytics provider Sumo Logic of about $160,000 for the year.
"At the time, that was second only to our AWS bill," Zoller said.
The company took steps in 2018 to reduce the amount of data it stored with Sumo Logic for debugging. Instead, it streamed logs to a self-managed set of AWS S3 buckets, then transferred logs into Sumo Logic only when they were needed for debugging.
"If engineers wanted to do a query in Sumo Logic, they had to tell a chatbot what application and timeframe they cared about," Zoller said. "There were some things we had to give up, such as scheduled searches that were taking advantage of the fact that we were streaming all our logs into Sumo, but it was an OK tradeoff for us."
Bob ZollerFounding engineer, Good Eggs
In late 2019, however, Good Eggs participated in the beta release of a new Sumo pricing tier for infrequently accessed data, which costs about a tenth of the $2 to $3 per gigabyte price for the full-fledged Sumo Logic service.
"You don't realize what you have until it's gone," Zoller said. "When we were streaming all of our logs in, you take for granted that everything is at your fingertips on demand any time you want to ask a question -- thinking about the application and timeframe you want before doing a search slows down developer productivity."
Infrequent access removes the additional step for developers to refine their search and load the right data into Sumo Logic from S3. It's still in beta, so Zoller can't yet compare the cost of the feature to the existing S3 system, but he hopes it will allow his company to resume sending all its log data to Sumo Logic at a cost more comparable to its self-managed workaround.
As the new log monitoring system moves to general availability, Zoller said he hopes Sumo Logic develops tools that let engineers assess whether a query is worth the cost to run ahead of time. Sumo Logic reps said this feature will be included when the product becomes generally available later this year.
HAProxy 2.0 simplifies log monitoring, sharpens visibility
DoubleVerify, an ad verification service provider headquartered in New York, began replacing hardware load balancers from F5 Networks with software-based HAProxy systems three years ago in an effort to reduce equipment costs. It took considerable effort to get software-defined infrastructure to replicate hardware-level functionality for the company's billions of web requests per day.
"We were replacing dedicated network devices with software-based load balancers, which had not really been done at that point, at least not at that scale," said Wally Barnes III, senior systems reliability engineer at DoubleVerify. "Most people put web servers behind HAProxy and Nginx -- replacing network devices meant I had to dive into the guts of just what HAProxy could do, what tunings we could do, all the way down to the headers."
Moving HAProxy's load balancers into production also meant understanding how to take advantage of a newly elastically scalable pool of systems instead of a fixed set of physical devices. All of that often meant consulting vast amounts of log monitoring data generated by the HAProxy system.
"This is billions of requests daily -- how do we deal with all this logging data?" Barnes said. "This was something we had to solve, and we found out real quick that we couldn't turn full logging on for every one of those systems. There was no place to put it."
An early attempt to collect log monitoring data at just one of four data centers in New York hit data transfer limits on the company's Splunk log analytics system in just 15 minutes, Barnes recalled.
The company took its own steps to send partial log data from a portion of its HAProxy instances to Splunk with a combination of syslog data collection and a RabbitMQ pub/sub transfer system. But HAProxy 2.0, released in July 2019, includes native log sampling features that eased the company's log management burden and improved its visibility into the software-based load balancer pool over its homegrown approach. DoubleVerify has just finished its initial proof of concept testing for the new version.
"Now I can turn up the logging to 'full' and get request-level information from every system, just not as much of it," Barnes said. "That gives me a more holistic view of what's going on and gives developers more insight into systems."
The company's past piecemeal approach sometimes missed events that didn't hit the servers it collected data from within the pool. With HAProxy 2.0, Barnes expects to eliminate such blind spots.
"The results we could see were pretty indicative of what was going on, but we couldn't see every request," he said. "Sometimes there's a burst of traffic to one system, and this allows us to pick up on that."