Skip to content

Cert Manager and DataDog Logging

Last updated on May 7, 2020

Originally published at: https://happyvalley.dev/cert-manager-and-datadog-logging/

While we’re not fully sold on K8S everywhere at Happy Valley, we do love having a flexible platform to deploy to. When we do use K8S, two of the services that we deploy to every cluster we spin up are cert-manager and DataDog. If you’ve not run into them before, cert-manager is a slick tool that handles TLS for your K8S ingresses. DataDog is a kind of all-in-one monitoring solution that does far too many things for me to list them all here. In this article, we’re interested in the DataDog log aggregation functionality.

Anyway, on to the story. A little while ago, our DataDog alarms triggered on a high volume of error logs for a K8S deployment. Thankfully this happened in daytime hours so on-call was able to jump on it pretty quick. The problem was with our K8S cert-manager logging and how it was being handled in DataDog.

For those that haven’t used it before, DataDog’s logging functionality is pretty cool. By indexing the log feeds coming in we can take our unstructured stream of data and turn it into structured data. It makes using log data alongside conventional application metrics much easier.

DataDog has a bunch of integrations out of the box. By configuring the source attribute for the logs that we’re generating, we can signal to DataDog which integration to use and avoid writing a bunch of parsing logic/regex ourselves.

Regrettably, there’s no cert-manager integration yet. This means that DataDog has no way to figure out the log level of the logs and defaults to error. While DataDog’s documentation is good overall, I found it hard to figure out where to begin with this issue. I’ve written up our fix below for anyone else who’s struggling with this.

Cert Manager Logging

There are a few types of logs that show up from cert-manager from time to time. Problematic logs that we want to be tagged as an error:

E0124 12:10:23.593744 1 indexers.go:93] cert-manager/secret-for-certificate-mapper "msg"="unable to fetch certificate that owns the secret" "error"="Certificate.cert-manager.io "website-tls" not found" "certificate"={"Namespace":"production-website", "Name":"website-tls"} "secret"={"Namespace":"production-website","Name":"website-tls"}

And logs that – while interesting – aren’t errors:

I0124 12:10:23.539553 1 controller.go:242] cert-manager/controller-runtime/controller "level"=1 "msg"="Successfully Reconciled" "controller"="apiservice" "request"={"Namespace":"","Name":"v1beta1.metrics.k8s.io"}

Since DataDog can’t distinguish between them yet, let’s help it along.

Building the log pipeline

In fixing this issue I set up a new log processing pipeline with a series of log processors. First, I grabbed all of our cert-manager logs for the last month and exported them from DataDog:

Log view in Datadog

I noticed that each message body starts with a logging code (e.g. I0124), so figured I’d try to get unique codes. I ran a bash one-liner:

cat export.csv | cut -d ',' -f 5 | cut -d ' ' -f 1 | sed 's/"//g' | cut -c 1 | sort | uniq

Which resulted in this output:

E0109
E0110
E0123
E0124
F0122
I0113
I0124
message
W0114
W0123

Huh. That message is probably an indication that I was doing something silly by using cut on a csv without thinking about non-escaped comma chars in the logs themselves. If we ignore this anomaly we see codes starting with the letter to indicate log level, I = info, E = error, W = warn and F = fatal. This is sweet! That’s a pretty obvious mapping onto the log levels I’m used to.

To map these log level codes onto DataDog’s log levels, I created a new log pipeline for cert-manager-cainjector

the new pipeline creation

Setting up a new pipeline left a little to be desired when I last used it. It seems to default to only showing a preview of what would have been matched in the last hour? I’d recommend jumping on any log issue as it happens to make this bit easier.

There were three processors that I had to create in this pipeline. First, the grok parser to get the initial character out of the log message; then the lookup processor to map the character to something that DataDog understands; then the status mapper to set the log status attribute on the log line.

The grok parser is a combination of regex and DataDog’s parsing language. I didn’t find this super intuitive, but I don’t do a log of string manipulation day-to-day so ymmv:

first_char %{regex("[a-zA-Z]{1}"):log_code}.*

In the UI:

The configuration of the grok parser.

The lookup processor is next. It manually maps an input string to an output string (e.g. “I” -> “info”). It’s the next processor in our pipeline, and we’re using it to map the level characters from the previous step into the log level words that DataDog can grok. Our config was:

E,error
I,info
W,warn
F,fatal

You can see the rest of our config in the image below:

Configuration of the lookup processor.

Now we have an attribute associated with each log line called log_status which corresponds to the DataDog log levels. Next up, the status remapper. The status remapper is a processor that configures which parsed field to treat as the log level or status. In our case that’s log_status:

Status remapper configuration in datadog.

And that’s it! Our cert-manager logs are now being reported at the appropriate log level. I’m sure there’s a better way to do this that I haven’t thought of, but this is working great for us. We don’t have cert-manager log false alarms now at least.

Summary

We’re a super small company, so we really can’t eat the cost of operational overhead unless absolutely required. That’s why it’s vital for us to jump on issues like this early before they swamp us with bogus alarms. While this isn’t the only time we’ve had to do this with DataDog logging so far, each time has been pretty much the same process (sometimes just copying pipelines if the sources use the same log convention).

DataDog’s log parsing itself was a little tough to figure out, but their support was extremely helpful and I’ve not had to touch this since I implemented it. If you’ve solved this in another way I’d love to know how in the comments.

Published inDebuggingTechnical

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Creative Commons License
Except where otherwise noted, Happy Valley IO Blog by Happy Valley IO is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
You are welcome to buy a license for any post on this site by contacting support@happyvalley.io