Geneos and Open Telemetry Best Practise
Introduction
The examples in this document are intended to form the initial basis for best practise advice to Geneos administrators on how to integrate with Open Telemetry instrumented applications.
Open Telemetry is a widely used, vendor-neutral, open-source framework for instrumenting and reporting telemetry data in the form of signals, which include traces, metrics and logs.
We will use the common short form OTel to refer to Open Telemetry.
Audience
These examples are intended for experienced Geneos administrators who have been tasked with integrating Open Telemetry instrumented applications into their existing Geneos estates.
If you are, rather, experienced with OTel and have been asked to integrate it with Geneos, with which you may not have extensive experience, then we suggest the following introductory documents first:
Assumptions
You will have new or existing Open Telemetry instrumented applications and we cannot expect to change how this instrumentation is configured. The best we can hope for is that Open Telemetry common/best practises have been followed.
As a part of this we are going to assume that the practises in the OTel demo environment are considered as good practise: https://opentelemetry.io/docs/demo/
Geneos Overview
As an experienced Geneos user you may still be new to the Collection Agent, which is distributed as part of the Netprobe, and is typically managed as a separate process via the Netprobe configuration. If you need to know more please take a look at Introduction to Collection Agent.
Open Telemetry Overview
A good place to start with OTel is to work through their documentation starting with the Observability Primer
Their demo environment exercises the various signal types and language support for OTel as well as providing a good set of UIs to see what is happening in real time.
Challenges
OTel places a lot of emphasis on distributed tracing, while Geneos is more focused on exception monitoring. At the time of writing Geneos, through the Collection Agent, only supports Metrics and Logs fully and extracts only timing metrics from root spans in traces.
Approach
We are going to start with the OTel demo environment and work through, step-by-step, integrating it with Geneos. If your OTel instrumented application follows the patterns used by the OTel demo environment then many of the examples will work with minimal changes.
Let’s Get Going
Geneos Setup
First, let’s make sure there is a suitable Geneos set-up to run an up-to-date Collection Agent. All of the examples we use were tested with release 6.7.1 of both Gateway and Netprobe. You can use an existing Gateway and Netprobe if you choose, or for testing you can create new ones (this assumes you use geneos
from the cordial
toolset):
$ geneos deploy gateway otel-test -S
certificate created for gateway "otel-test" (expires 2025-04-10 00:00:00 +0000 UTC)
gateway "otel-test" added, port 7102
gateway "otel-test" started with PID 78905
$ geneos deploy netprobe otel-test -S
certificate created for netprobe "otel-test" (expires 2025-04-10 00:00:00 +0000 UTC)
netprobe "otel-test" added, port 7103
netprobe "otel-test" started with PID 79166
Please note that the above example also creates new certificates as the TLS subsystem has been previously initialised. All the example Gateway configurations assume secure connections between Geneos components. Please adjust accordingly if you are not using TLS. The OTLP gRPC port we use is not secured using TLS, but this can be done by using the certificate and key created for the Netprobe and adjusting the configurations accordingly.
Installing the OTel Demo
Install the demo in your environment as per the documentation. We will use docker, so it’s a case of following the start of the OTel demo documentation for Bring Your Own Backend:
git clone https://github.com/open-telemetry/opentelemetry-demo.git
cd opentelemetry-demo
To ensure that everything will work you can start the demo at this point and try out the UIs before moving on to connect it to the Geneos Collection Agent. Follow the appropriate instructions in the above docs.
If you have any problems with the demo at this point please see the Troubleshooting section below and also the OTel demo documentation for further help.
Direct OTel Signals to Collection Agent
Follow the instructions in Bring Your Own Backend to add the Collection Agent OTel endpoint as an oltpgrpc
URL. Remember to stop and restart the demo after making change to the file mentioned below.
First, add an otlp
exporter (for OTLP over gRPC, as opposed to otlphttp
for HTTP) and to give it a unique name add the /ca
for Collection Agent. If you want to extend the set-up to use multiple Collection Agents then you can use other names, such a /ca1
or /caprimary
etc. just make sure you also adjust the other settings further on.
Edit src/otelcollector/otelcol-config-extras.yml
in the opentelemetry-demo directory:
exporters:
otlp/ca:
endpoint: "CAHOST:4317"
tls:
insecure: true
You should replace CAHOST
with the name or IP address of the Collection Agent host.
Next, add a service
section but remember to include the default exporters, as the array entries (those in [...]
) overwrite the defaults, not append to them:
service:
pipelines:
traces:
exporters: [otlp, debug, spanmetrics, otlp/ca]
metrics:
exporters: [otlphttp/prometheus, debug, otlp/ca]
logs:
exporters: [opensearch, debug, otlp/ca]
As you can see above, the arrays for each exporter have otlp/ca
appended. The other YAML settings do not have to be included, and are left unchanged.
Configure Collection Agent and Gateway
You should now set-up the Netprobe that will run the Collection Agent and connect it to your Gateway.
Note that for the Collection Agent to correctly start-up and to report Self-Monitoring you have to set two environment variables:
-
JAVA_HOME
must point to the top level of the Java runtime you will use-
Note: Java 17 is required for the Collection Agent
Add it like this on Debian based Linuxes:
sudo apt install openjdk-17-jdk
-
-
HOSTNAME
must be the hostname of your system so that the self-monitoring can identify itself as a Dynamic Entity with that name
You can set both of these like this:
$ geneos set netprobe otel-test -e JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 -e HOSTNAME=myHostName
$ geneos restart netprobe otel-test
netprobe "otel-test" started with PID 98945
In the Gateway config set-up a Probe entry with the Dynamic Entities
tab filled in. If you use your own names remember to also make changes to the other examples below.
As for all these XML configuration section, copy them using the icon in the top right (if visible) or manually selevct the XML and CTRL+C. Then, in the GSE select the parent configuration section and right click and select “Paste XML”
<probe name="otel-test">
<hostname>localhost</hostname>
<port>7036</port>
<secure>true</secure>
<dynamicEntities>
<mappingType ref="otel-mapping-type"/>
<collectionAgentParameters ref="otel-ca"/>
</dynamicEntities>
</probe>
Update the
hostname
and port
above to match the Netprobe you have either created or have already selected.
Now, under the Dynamic Entities
section of the GSE you have to create one configuration in each of the four section:
-
Mapping:
<mapping name="ca-self-monitor"> <builtIn> <name>Collection Agent V3</name> </builtIn> </mapping>
-
Collectors:
<collector name="otel-collector"> <plugin> <open-telemetry> <OpenTelemetryCollector> <yaml> <data/> </yaml> </OpenTelemetryCollector> </open-telemetry> </plugin> </collector>
Note above that the
<yaml><data/></yaml>
section can be empty to accept all the default values. We will start filling in this section later, but this empty set of defaults will work. -
Mapping Types:
<mappingType name="otel-mapping-type"> <collectors> <collector ref="otel-collector"/> </collectors> <mappings> <mapping ref="ca-self-monitor"/> </mappings> </mappingType>
-
Collection Agent Parameters:
<collectionAgentParameters name="otel-ca"> <managed> <yaml> <managedYAML> <selfMonitoring>true</selfMonitoring> </managedYAML> </yaml> </managed> </collectionAgentParameters>
Start the Collection Agent
Start the Collection Agent (and Netprobe) and make sure that you see the Self-Monitoring data after about 30 seconds:
Note in the latest versions of Geneos the
Entities
view only shows Dynamic Entities in a Netprobe that are in an errored state. Use the headline, as indicated above, to check that one Dynamic Entity has been accepted. You can change this behaviour to show a row-per-entity by overriding the default sampler used. See the Dynamic Entities Health plugin documentation for more.
You should now have a Managed Entity with the same name as your HOSTNAME
environment variable set in the Netprobe environment:
Once the OTel demo is connected and streaming to the Geneos Collection Agent then you will also see a large number of entries in InvalidMetrics
and InvalidStreams
under the Netprobe Info entity, like this:
This is not a bad thing as such, in fact it means that the OTel demo is successfully sending data to the Collection Agent. The next step is to start mapping the OTel signal to useable Geneos data.
Mappings
Geneos supports Dynamic Entities through mappings that allow you to make statements about the meaning of the signals you receive in the Collection Agent and how those should be displayed in Geneos.
OTel has the concept of semantic conventions to help ensure that signals reported by OTel can be used to add meaning to the collected data. The current specification is here: https://opentelemetry.io/docs/specs/semconv/
Let’s use these two things to build some simple Geneos configuration that can display signals from the OTel demo in a meaningful way.
You could use the InvalidMetrics
and InvalidStreams
Dataview to do this, but the Collection Agent logs are a much better place to start.
If you are using the latest geneos
from the cordial
repo, then you can do something like this:
geneos logs netprobe otel-test -C
This will output the last 10 lines from both the main Netprobe log file and also the Collection Agent log. You have to look at both log files as they each contain information you can use to create your mappings.
The main Netprobe logfile has entries like this:
===> netprobe "otel-test" /home/peter/geneos/netprobe/netprobes/otel-test/netprobe.log <===
2024-04-09 10:29:50.727+0100 ERROR: DynamicEntities No mapper matches the metric 'grpc.oteldemo.CurrencyService/Convert_latency|host.arch=amd64|host.name=9f0167b27676|os.type=linux|os.version=6.5.0-21-generic|process.command=/app/server.js|process.command_args=/usr/local/bin/node_--require_./Instrumentation.js_/app/server.js|process.executable.name=node|process.executable.path=/usr/local/bin/node|process.owner=nextjs|process.pid=17|process.runtime.description=Node.js|process.runtime.name=nodejs|process.runtime.version=18.20.1|service.name=frontend'. It will be ignored
2024-04-09 10:29:50.727+0100 ERROR: DynamicEntities No mapper matches the metric 'grpc.oteldemo.ProductCatalogService/ListProducts_latency|host.arch=amd64|host.name=9f0167b27676|os.type=linux|os.version=6.5.0-21-generic|process.command=/app/server.js|process.command_args=/usr/local/bin/node_--require_./Instrumentation.js_/app/server.js|process.executable.name=node|process.executable.path=/usr/local/bin/node|process.owner=nextjs|process.pid=17|process.runtime.description=Node.js|process.runtime.name=nodejs|process.runtime.version=18.20.1|service.name=frontend'. It will be ignored
2024-04-09 10:30:28.849+0100 ERROR: DynamicEntities No mapper matches the metric 'process.runtime.dotnet.gc.committed_memory.size|container.id=bb51603d87178c21eeb07715c49f6ec375638972569f195341d0af2e250289cb|host.name=bb51603d8717|service.name=cartservice'. It will be ignored
2024-04-09 10:30:28.849+0100 ERROR: DynamicEntities No mapper matches the metric 'process.runtime.dotnet.gc.heap.size|container.id=bb51603d87178c21eeb07715c49f6ec375638972569f195341d0af2e250289cb|host.name=bb51603d8717|service.name=cartservice'. It will be ignored
2024-04-09 10:30:28.849+0100 ERROR: DynamicEntities No mapper matches the metric 'process.runtime.dotnet.gc.heap.fragmentation.size|container.id=bb51603d87178c21eeb07715c49f6ec375638972569f195341d0af2e250289cb|host.name=bb51603d8717|service.name=cartservice'. It will be ignored
As you scroll across the lines you will see some common names, which are using the OTel semantic conventions. The two that jump out most often are host.name
and service.name
.
It may be tempting to immediately use these values as an Entity and Dataview in Geneos, but this isn’t always the case. The host.name
value is the docker container ID and is not very descriptive and more importantly is dynamic and will change from one run of the demo environment to the next. service.name
may be a better candidate. Also, service.namespace
makes a good Geneos Attribute.
Let’s try this as a new Collector:
<mapping name="otel-basic">
<custom>
<geneosItems>
<geneosItem>
<label>service.name</label>
<entity>
<required>true</required>
<useInDisplayName>true</useInDisplayName>
</entity>
</geneosItem>
<geneosItem>
<label>service.namespace</label>
<attribute/>
</geneosItem>
</geneosItems>
</custom>
<localAttributes>
<localAttribute>
<name>COMPONENT</name>
<value>
<data>Astronomy Shop</data>
</value>
</localAttribute>
</localAttributes>
</mapping>
Add this new Collector to your Mapping Type, replacing the initial one above with:
<mappingType name="otel-mapping-type">
<collectors>
<collector ref="otel-collector"/>
</collectors>
<mappings>
<mapping ref="ca-self-monitor"/>
<mapping ref="otel-basic"/>
</mappings>
</mappingType>
If you save this you should now start seeing new Managed Entities pop up after about 30 seconds. But wait! There’s something wrong. The Dataview names are very long and include too much unwanted information. Similarly, in each Dataview there is are one or two rows with a very convoluted names instead of two column of data, like we would expect.
This is because all of the semantic properties from OTel are being converted to Geneos dimensions (and some Properties).
What we want to do is to stop all the attributes, except for the ones we are interested in, from being turned into Dimensions. We do this in the otel-collector
configuration from above; If you remember we left the YAML part empty? Now, let’s fill it in:
Paste this into the YAML part of the otel-collector
configuration. Be careful with spacing as YAML is highly sensitive to indentation:
resource-attributes:
metrics:
base: none
include:
- service.namespace
- service.name
- service.instance.id
logs:
base: none
include:
- service.namespace
- service.name
- service.instance.id
traces:
base: none
include:
- service.namespace
- service.name
- service.instance.id
data-attributes:
metrics:
base: none
include:
- service.namespace
- service.name
- service.instance.id
logs:
base: none
include:
- service.namespace
- service.name
- service.instance.id
traces:
base: none
include:
- service.namespace
- service.name
- service.instance.id
What this configuration does is to change the default behaviour of the resource-attributes
from including all semantic names and the data-attributes
from including all names to both only passing up the three attributes we have set. Everything else becomes a “Property”, which means that it is available to Geneos but not used for naming.
If you now save the configuration you should see useful data flowing through!
Note: Because of the flow of data during the configuration changes, you may see attributes other than those above in the Dataview and row names. To clean this out, restart the Collection Agent (and Netprobe):
geneos restart netprobe otel-test
If you now right-click on a cell and run the Show Dynamic Mappings
command you will see something like this:
Here you can see that only two Dimensions
has made it through and all the other OTel attributes are Properties. The Geneos Attributes include both the service.name
and the service.namespace
.
Logs
The final piece of the puzzle is to be able to process the log entries being sent from OTel. To do this we need to add an FKM sampler to the otel-mapping-type
.
First, we create the FKM sampler that subscribes to all streams, so that we can see what logs are being published:
<sampler name="logs">
<var-group>
<data>logs</data>
</var-group>
<plugin>
<fkm>
<files>
<file>
<source>
<stream>
<data>*</data>
</stream>
</source>
</file>
</files>
</fkm>
</plugin>
</sampler>
Then, we add a reference to this sampler in otel-mapping-type
, like this:
<mappingType name="otel-mapping-type">
<collectors>
<collector ref="otel-collector"/>
</collectors>
<mappings>
<mapping ref="ca-self-monitor"/>
<mapping ref="otel-basic"/>
</mappings>
<samplers>
<sampler ref="logs"/>
</samplers>
</mappingType>
If you now locate one of the Entities that contains a logs
Dataview, you should be able to right-click on one of the streams data cells and run the View File
command. If you select Continuous
monitoring you will be able to watch the file stream updating in real-time. You can then begin using this to build FKM key tables to identify error conditions as normal. If you already know the likely log entries that need to be identified as error or exception conditions then you can go ahead and implement those too.
To use separate FKM samplers for different OTel log sources, so that you can use different FKM key tables, you can add multiple FKM samplers and as long as the stream name only matches the OTel log name then only that sampler will be applied to the log entries, e.g. setting up a new FKM sampler for kafka.*
logs:
<sampler name="kafka-logs">
<var-group>
<data>logs</data>
</var-group>
<plugin>
<fkm>
<files>
<file>
<source>
<stream>
<data>kafka.*</data>
</stream>
</source>
</file>
</files>
</fkm>
</plugin>
</sampler>
and adding this to the otel-mapping-type
in addition to the catch-all logs
sampler:
<mappingType name="otel-mapping-type">
<collectors>
<collector ref="otel-collector"/>
</collectors>
<mappings>
<mapping ref="ca-self-monitor"/>
<mapping ref="otel-basic"/>
</mappings>
<samplers>
<sampler ref="logs"/>
<sampler ref="kafka-logs"/>
</samplers>
</mappingType>
Now a new sampler appears but only in those entities that publish logs with names beginning kafka.
:
Unless you also change the original logs
sampler, the same logs will continue to also appear there.
Troubleshooting
-
OTel demo in WSL with docker
If you are running the OTel demo environment in WSL under docker you will have to “fix” the DNS resolver as the built-in local resolver will not work correctly.
Edit
/etc/wsl.conf
(as root, usingsudo
etc.) and add:[network] generateResolvConf = false
Then stop and restart WSL, e.g from a PowerShell prompt use
wsl --shutdown
to ensure the subsystem is actually stopped. Then start it as normal and create you own/etc/resolv.conf
using an external DNS resolver, e.g.nameserver 8.8.8.8
If you are hosting Geneos components on your WSL instance than you will have to be careful with the local host definitions and port forwarding, and you may nbeed to make further changes. It may be simplest, for testing, to use the WSL private IP instead of a hostname in the OTel collector extras file before starting the demo. See the comments in the
/etc/hosts
file in WSL for more info.Now start the demo and it should work correctly.
-
Missing Log Datapoints
At the time of writing the Collection Agent OTel plugin is incorrectly dropping logs that have a zero
EventTime
. For example, in the demo thecurrencyservice
should show a logs sampler but does not. This is expected to be fixed in a release soon.These are shown in the
colleciton-agent.log
like this:2024-04-11 10:02:23.330 [grpc-server-otel-collector-1] WARN com.itrsgroup.collection.ca.workflow.pipelines.WorkflowPipeline(logs) - Dropping invalid data point: TIMESTAMP_OUT_OF_RANGE {Type=LOG_EVENT, Name=currencyservice, Namespace=itrsgroup.com/c2/opentelemetry-plugin/logs, Dimensions={service.name=currencyservice, service.namespace=opentelemetry-demo}, Properties={telemetry.sdk.version=1.13.0, telemetry.sdk.name=opentelemetry, telemetry.sdk.language=cpp, trace_id=806388ed2e79024859168d0502ad1430, span_id=f03263b113141f2b, ca_collector_name=otel-collector, itrsgroup.com/agent-timestamp=1712826143330}, EventTimestamp=0, ObservedTimestamp=1712826143330000000, Message=GetSupportedCurrencies successful, Count=1, Window={0 ns <- 0 ns}, SourceEventId=null, Severity=INFO} - skipped 9 similar log messages over the past 2 seconds
-
Other Errors in
collection-agent.log
The current Collection Agent OTel plugin does not support all signal types, and particularly Traces are not fully supported. The timings from traces are processed and used to build histograms but the plugin only supports delta times and not absolute times, so you will see these kinds of errors:
2024-04-11 10:02:20.942 [grpc-server-otel-collector-2] WARN com.itrsgroup.collection.plugins.opentelemetry.OpenTelemetryCollector(otel-collector) - No valid mapping for cumulative histogram - switch to delta aggregation if possible
Example Configuration
Below is a complete Gateway configuration include file with all of the above examples.
Please remember to change the Probe hostname and port and Secure flag to match your local instances.
<?xml version="1.0" encoding="ISO-8859-1"?>
<gateway compatibility="1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.itrsgroup.com/GA6.7.1-240223/gateway.xsd">
<probes>
<probe name="otel-test">
<hostname>localhost</hostname>
<port>7036</port>
<secure>true</secure>
<dynamicEntities>
<mappingType ref="otel-mapping-type"/>
<collectionAgentParameters ref="otel-ca"/>
</dynamicEntities>
</probe>
</probes>
<dynamicEntities>
<simpleMappings>
<mapping name="ca-self-monitor">
<builtIn>
<name>Collection Agent V3</name>
</builtIn>
</mapping>
<mapping name="otel-basic">
<custom>
<geneosItems>
<geneosItem>
<label>service.name</label>
<entity>
<required>true</required>
<useInDisplayName>true</useInDisplayName>
</entity>
</geneosItem>
<geneosItem>
<label>service.namespace</label>
<attribute/>
</geneosItem>
</geneosItems>
</custom>
<localAttributes>
<localAttribute>
<name>COMPONENT</name>
<value>
<data>Astronomy Shop</data>
</value>
</localAttribute>
</localAttributes>
</mapping>
</simpleMappings>
<collectors>
<collector name="otel-collector">
<plugin>
<open-telemetry>
<OpenTelemetryCollector>
<yaml>
<data>resource-attributes:
metrics:
base: none
include:
- service.namespace
- service.name
- service.instance.id
logs:
base: none
include:
- service.namespace
- service.name
- service.instance.id
traces:
base: none
include:
- service.namespace
- service.name
- service.instance.id
data-attributes:
metrics:
base: none
include:
- service.namespace
- service.name
- service.instance.id
logs:
base: none
include:
- service.namespace
- service.name
- service.instance.id
traces:
base: none
include:
- service.namespace
- service.name
- service.instance.id
</data>
</yaml>
</OpenTelemetryCollector>
</open-telemetry>
</plugin>
</collector>
</collectors>
<mappingTypes>
<mappingType name="otel-mapping-type">
<collectors>
<collector ref="otel-collector"/>
</collectors>
<mappings>
<mapping ref="ca-self-monitor"/>
<mapping ref="otel-basic"/>
</mappings>
<samplers>
<sampler ref="logs"/>
<sampler ref="kafka-logs"/>
</samplers>
</mappingType>
</mappingTypes>
<collectionAgentParameters>
<collectionAgentParameters name="otel-ca">
<managed>
<yaml>
<managedYAML>
<selfMonitoring>true</selfMonitoring>
</managedYAML>
</yaml>
</managed>
</collectionAgentParameters>
</collectionAgentParameters>
</dynamicEntities>
<samplers>
<sampler name="logs">
<var-group>
<data>logs</data>
</var-group>
<plugin>
<fkm>
<files>
<file>
<source>
<stream>
<data>*</data>
</stream>
</source>
</file>
</files>
</fkm>
</plugin>
</sampler>
<sampler name="kafka-logs">
<var-group>
<data>logs</data>
</var-group>
<plugin>
<fkm>
<files>
<file>
<source>
<stream>
<data>kafka.*</data>
</stream>
</source>
</file>
</files>
</fkm>
</plugin>
</sampler>
</samplers>
</gateway>