Introduction to OpenTelemetry & Distributed tracing — Part III
Working with Collectors
In this article, we will be modifying our Architecture to not export the trace data to Zipkin directly. Instead ,we will be using the Collector component of the OpenTelemetry to collect all the traces, perform some custom processing and then forward the trace information for storage in to Zipkin server.
Why to use Collectors?
Many a times, we do not want all the traces to end up in the trace database. It may become very costly to manage the space required to handle trace data in a big application as the volume can be huge. You could have requirements like below
- How about only tracing those requests where the total time for the request exceeds a certain limit?
- How about tracing only requests which are for a critical business application for a certain customer?
- How about dropping health endpoints traces which creates a lot of noise in your trace data?
To handle all such use cases, you need a way to collect all the trace data and then apply custom processing logic to filter the unwanted traces. This is where Collector component shines and is very critical for a matured deployment of a tracing solution.
The architecture gets changed if we use Collector as below. As you can see, the trace data is not directly exported to Zipkin server but first fed to an OpenTelemetry Collector agent. The Collector is further configured to export the processed trace data to Zipkin server. You can use the exporter of your choice like Jaeger etc.
Understanding Collectors
The Collector consists of three components that access telemetry data:
- Receivers — A receiver, which can be push or pull based, is how data gets into the Collector. Receivers may support one or more data sources.
- Processors — Processors are run on data between being received and being exported. Processors are optional though some are recommended.
- Exporters — An exporter, which can be push or pull based, is how you send data to one or more backends/destinations. Exporters may support one or more data sources.
Another important concept to understand is the usage of Pipeline in the Collector configuration.
Pipelines can be of the following types:
- traces: collects and processes trace data.
- metrics: collects and processes metric data.
- logs: collects and processes log data.
A pipeline consists of a set of receivers, processors and exporters. Each receiver/processor/exporter must be defined in the configuration outside of the service section to be included in a pipeline.
Demo :
To demonstrate the usage of a collector, I have introduced dummy /health check endpoints which gets triggered every 10 secs. This creates lot of noise in our trace database which we want to avoid. Lets see how to do this using the collector component.
Step 1: Introduce dummy health check endpoints
Add below method in to the SchoolServiceController and StudentserviceController.
@RequestMapping(value = "/health", method = RequestMethod.GET)
public String getHealth() {
return "School Service is Healthy";
}
Add below class in both the services. This class will call the health endpoints of respective services every 10 secs.
package com.nitin.otel.oteltracing;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.ResponseEntity;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import org.springframework.web.client.RestTemplate;@Component
public class MockHealthEndpointScheduler {
@Autowired
RestTemplate restTemplate;
@Scheduled(fixedRate = 10000)
public void fixedRateSch() {
ResponseEntity<String> response = restTemplate.getForEntity("http://localhost:9001/health",String.class);
System.out.println(response.getBody());
}
}
Check the Zipkin UI for the generated traces
Step 2 : Download and install the collector.
For windows user , you can download the binary from this location. Please download the collector-contrib package i.e otelcol-contrib.exe
https://github.com/open-telemetry/opentelemetry-collector-releases/releases
Step 3: Remove Zipkin VM arguments from the launch script/command
Remove below VM arguments from the launch script. This is because we do not want to set the exporters from client Instrumentation.
-Dotel.traces.exporter=zipkin
-Dotel.exporter.zipkin.endpoint=http://localhost:9411/api/v2/spans
This configuration removal will now cause the Java agent to push the tracing data to a collector agent which is set by default. But to view the traces, we still need the collector to send the trace data to the Zipkin .
Step 4: Create a otel-config.yaml file and add a Zipkin exporter.
This configuration file is read by the Collector. As you can see, the default receiver is configured to use OLTP format and allows for both http & grpc protocols.
receivers:
otlp:
protocols:
grpc:
http:
exporters:
logging:
zipkin:
endpoint: "http://localhost:9411/api/v2/spans"
tls:
insecure: trueservice:
pipelines:
traces:
receivers: [otlp]
exporters: [zipkin]
processors: [nop]
telemetry:
logs:
level: "debug"
Step 5: Validate the results to see if trace data arrives at Zipkin via Collector.
Start otelcol-contrib.exe using the following command
D:\Installations>otelcol-contrib.exe --config=otel-config.yaml
The collector will start and start sending the trace data to Zipkin. Validate that Zipkin is receiving the /health endpoint traces after the changes.
If you can see it, then you are good !! Now we want to get rid of these noisy trace data :)
Step 6: Update otel-config.yaml to add custom processor to filter out the health endpoints related traces.
Restart the collector with the updated configuration below. You will notice that the health endpoints are no more sent to the Zipkin Server.
receivers:
otlp:
protocols:
grpc:
http:exporters:
zipkin:
endpoint: "http://localhost:9411/api/v2/spans"
tls:
insecure: true
processors:
#groupbytrace:
groupbytrace/custom:
wait_duration: 2s
num_traces: 1000
tail_sampling:
decision_wait: 10s
#num_traces: 100
#expected_new_traces_per_sec: 10
policies:
[
{
name: stop-health-checks,
type: string_attribute,
string_attribute: {key: http.target, values: [\/health],enabled_regex_matching: true, invert_match: true}
}
]service:
pipelines:
traces:
receivers: [otlp]
processors: [groupbytrace/custom ,tail_sampling]
exporters: [zipkin]
telemetry:
logs:
level: "debug"
You were successful to use a sampling technique to manage trace data using a TailSamplingProcessor. Basically With tail-based sampling, we delay the sampling decision until all spans of a trace are available which enables better sampling decisions based on all data from the trace. For example, we can sample failed or unusually long traces.
In the TailSamplingProcessor , I have used the String_attribute type of sampling policy.
string_attribute: {key: http.target, values: [\/health],enabled_regex_matching: true, invert_match: true}
The configuration means I have to sample all the traces which have the attribute “http.target” and value not equal to “/health” as I do not wan to sample the /health check endpoints.
There are number of other ways to implement sampling using the Otel Collector library. You will be spoilt for choices once you take a look at all the processors available for sampling at the Github link
In the next article, I will try to cover the customization aspects of OpenTelemetry wherein you will be able to add Custom Identifiers to the traces and then be able to query or sample those traces using UI or collector.
All code of the article can be found on the GITHUB Link !! Enjoy learning ..
Next article link — https://nitin-rohidas.medium.com/using-custom-span-attributes-in-opentelemetry