Last updated on May 7, 2020
Originally published at: https://happyvalley.dev/cert-manager-and-datadog-logging/
While we’re not fully sold on K8S everywhere at Happy Valley, we do love having a flexible platform to deploy to. When we do use K8S, two of the services that we deploy to every cluster we spin up are cert-manager and DataDog. If you’ve not run into them before, cert-manager is a slick tool that handles TLS for your K8S ingresses. DataDog is a kind of all-in-one monitoring solution that does far too many things for me to list them all here. In this article, we’re interested in the DataDog log aggregation functionality.
Anyway, on to the story. A little while ago, our DataDog alarms triggered on a high volume of error logs for a K8S deployment. Thankfully this happened in daytime hours so on-call was able to jump on it pretty quick. The problem was with our K8S cert-manager logging and how it was being handled in DataDog.
For those that haven’t used it before, DataDog’s logging functionality is pretty cool. By indexing the log feeds coming in we can take our unstructured stream of data and turn it into structured data. It makes using log data alongside conventional application metrics much easier.
DataDog has a bunch of integrations out of the box. By configuring the source
attribute for the logs that we’re generating, we can signal to DataDog which integration to use and avoid writing a bunch of parsing logic/regex ourselves.
Regrettably, there’s no cert-manager integration yet. This means that DataDog has no way to figure out the log level of the logs and defaults to error
. While DataDog’s documentation is good overall, I found it hard to figure out where to begin with this issue. I’ve written up our fix below for anyone else who’s struggling with this.
Cert Manager Logging
There are a few types of logs that show up from cert-manager from time to time. Problematic logs that we want to be tagged as an error:
E0124 12:10:23.593744 1 indexers.go:93] cert-manager/secret-for-certificate-mapper "msg"="unable to fetch certificate that owns the secret" "error"="Certificate.cert-manager.io "website-tls" not found" "certificate"={"Namespace":"production-website", "Name":"website-tls"} "secret"={"Namespace":"production-website","Name":"website-tls"}
And logs that – while interesting – aren’t errors:
I0124 12:10:23.539553 1 controller.go:242] cert-manager/controller-runtime/controller "level"=1 "msg"="Successfully Reconciled" "controller"="apiservice" "request"={"Namespace":"","Name":"v1beta1.metrics.k8s.io"}
Since DataDog can’t distinguish between them yet, let’s help it along.
Building the log pipeline
In fixing this issue I set up a new log processing pipeline with a series of log processors. First, I grabbed all of our cert-manager logs for the last month and exported them from DataDog:

I noticed that each message body starts with a logging code (e.g. I0124
), so figured I’d try to get unique codes. I ran a bash one-liner:
cat export.csv | cut -d ',' -f 5 | cut -d ' ' -f 1 | sed 's/"//g' | cut -c 1 | sort | uniq
Which resulted in this output:
E0109 E0110 E0123 E0124 F0122 I0113 I0124 message W0114 W0123
Huh. That message
is probably an indication that I was doing something silly by using cut
on a csv without thinking about non-escaped comma chars in the logs themselves. If we ignore this anomaly we see codes starting with the letter to indicate log level, I = info, E = error, W = warn and F = fatal. This is sweet! That’s a pretty obvious mapping onto the log levels I’m used to.
To map these log level codes onto DataDog’s log levels, I created a new log pipeline for cert-manager-cainjector

Setting up a new pipeline left a little to be desired when I last used it. It seems to default to only showing a preview of what would have been matched in the last hour? I’d recommend jumping on any log issue as it happens to make this bit easier.
There were three processors that I had to create in this pipeline. First, the grok parser
to get the initial character out of the log message; then the lookup processor
to map the character to something that DataDog understands; then the status mapper
to set the log status attribute on the log line.
The grok parser is a combination of regex and DataDog’s parsing language. I didn’t find this super intuitive, but I don’t do a log of string manipulation day-to-day so ymmv:
first_char %{regex("[a-zA-Z]{1}"):log_code}.*
In the UI:

The lookup processor
is next. It manually maps an input string to an output string (e.g. “I” -> “info”). It’s the next processor
in our pipeline, and we’re using it to map the level characters from the previous step into the log level words that DataDog can grok. Our config was:
E,error
I,info
W,warn
F,fatal
You can see the rest of our config in the image below:

Now we have an attribute associated with each log line called log_status
which corresponds to the DataDog log levels. Next up, the status remapper
. The status remapper
is a processor that configures which parsed field to treat as the log level or status. In our case that’s log_status
:

And that’s it! Our cert-manager logs are now being reported at the appropriate log level. I’m sure there’s a better way to do this that I haven’t thought of, but this is working great for us. We don’t have cert-manager log false alarms now at least.
Summary
We’re a super small company, so we really can’t eat the cost of operational overhead unless absolutely required. That’s why it’s vital for us to jump on issues like this early before they swamp us with bogus alarms. While this isn’t the only time we’ve had to do this with DataDog logging so far, each time has been pretty much the same process (sometimes just copying pipelines if the sources use the same log convention).
DataDog’s log parsing itself was a little tough to figure out, but their support was extremely helpful and I’ve not had to touch this since I implemented it. If you’ve solved this in another way I’d love to know how in the comments.
Be First to Comment