Geneos and Open Telemetry Best Practise

Geneos and Open Telemetry Best Practise

Introduction

The examples in this document are intended to form the initial basis for best practise advice to Geneos administrators on how to integrate with Open Telemetry instrumented applications.

Open Telemetry is a widely used, vendor-neutral, open-source framework for instrumenting and reporting telemetry data in the form of signals, which include traces, metrics and logs.

We will use the common short form OTel to refer to Open Telemetry.

Audience

These examples are intended for experienced Geneos administrators who have been tasked with integrating Open Telemetry instrumented applications into their existing Geneos estates.

If you are, rather, experienced with OTel and have been asked to integrate it with Geneos, with which you may not have extensive experience, then we suggest the following introductory documents first:

Assumptions

You will have new or existing Open Telemetry instrumented applications and we cannot expect to change how this instrumentation is configured. The best we can hope for is that Open Telemetry common/best practises have been followed.

As a part of this we are going to assume that the practises in the OTel demo environment are considered as good practise: https://opentelemetry.io/docs/demo/

Geneos Overview

As an experienced Geneos user you may still be new to the Collection Agent, which is distributed as part of the Netprobe, and is typically managed as a separate process via the Netprobe configuration. If you need to know more please take a look at Introduction to Collection Agent.

Open Telemetry Overview

A good place to start with OTel is to work through their documentation starting with the Observability Primer

Their demo environment exercises the various signal types and language support for OTel as well as providing a good set of UIs to see what is happening in real time.

Challenges

OTel places a lot of emphasis on distributed tracing, while Geneos is more focused on exception monitoring. At the time of writing Geneos, through the Collection Agent, only supports Metrics and Logs fully and extracts only timing metrics from root spans in traces.

Approach

We are going to start with the OTel demo environment and work through, step-by-step, integrating it with Geneos. If your OTel instrumented application follows the patterns used by the OTel demo environment then many of the examples will work with minimal changes.

Let’s Get Going

Geneos Setup

First, let’s make sure there is a suitable Geneos set-up to run an up-to-date Collection Agent. All of the examples we use were tested with release 6.7.1 of both Gateway and Netprobe. You can use an existing Gateway and Netprobe if you choose, or for testing you can create new ones (this assumes you use geneos from the cordial toolset):

$ geneos deploy gateway otel-test -S
certificate created for gateway "otel-test" (expires 2025-04-10 00:00:00 +0000 UTC)
gateway "otel-test" added, port 7102
gateway "otel-test" started with PID 78905
$ geneos deploy netprobe otel-test -S
certificate created for netprobe "otel-test" (expires 2025-04-10 00:00:00 +0000 UTC)
netprobe "otel-test" added, port 7103
netprobe "otel-test" started with PID 79166

:warning: Please note that the above example also creates new certificates as the TLS subsystem has been previously initialised. All the example Gateway configurations assume secure connections between Geneos components. Please adjust accordingly if you are not using TLS. The OTLP gRPC port we use is not secured using TLS, but this can be done by using the certificate and key created for the Netprobe and adjusting the configurations accordingly.

Installing the OTel Demo

Install the demo in your environment as per the documentation. We will use docker, so it’s a case of following the start of the OTel demo documentation for Bring Your Own Backend:

git clone https://github.com/open-telemetry/opentelemetry-demo.git
cd opentelemetry-demo

To ensure that everything will work you can start the demo at this point and try out the UIs before moving on to connect it to the Geneos Collection Agent. Follow the appropriate instructions in the above docs.

If you have any problems with the demo at this point please see the Troubleshooting section below and also the OTel demo documentation for further help.

Direct OTel Signals to Collection Agent

Follow the instructions in Bring Your Own Backend to add the Collection Agent OTel endpoint as an oltpgrpc URL. Remember to stop and restart the demo after making change to the file mentioned below.

First, add an otlp exporter (for OTLP over gRPC, as opposed to otlphttp for HTTP) and to give it a unique name add the /ca for Collection Agent. If you want to extend the set-up to use multiple Collection Agents then you can use other names, such a /ca1 or /caprimary etc. just make sure you also adjust the other settings further on.

Edit src/otelcollector/otelcol-config-extras.yml in the opentelemetry-demo directory:

exporters:
  otlp/ca:
    endpoint: "CAHOST:4317"
    tls:
      insecure: true

You should replace CAHOST with the name or IP address of the Collection Agent host.

Next, add a service section but remember to include the default exporters, as the array entries (those in [...]) overwrite the defaults, not append to them:

service:
  pipelines:
    traces:
      exporters: [otlp, debug, spanmetrics, otlp/ca]
    metrics:
      exporters: [otlphttp/prometheus, debug, otlp/ca]
    logs:
      exporters: [opensearch, debug, otlp/ca]

As you can see above, the arrays for each exporter have otlp/ca appended. The other YAML settings do not have to be included, and are left unchanged.

Configure Collection Agent and Gateway

You should now set-up the Netprobe that will run the Collection Agent and connect it to your Gateway.

:warning: Note that for the Collection Agent to correctly start-up and to report Self-Monitoring you have to set two environment variables:

  • JAVA_HOME must point to the top level of the Java runtime you will use

    • Note: Java 17 is required for the Collection Agent

      Add it like this on Debian based Linuxes:

      sudo apt install openjdk-17-jdk
      
  • HOSTNAME must be the hostname of your system so that the self-monitoring can identify itself as a Dynamic Entity with that name

You can set both of these like this:

$ geneos set netprobe otel-test -e JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 -e HOSTNAME=myHostName
$ geneos restart netprobe otel-test
netprobe "otel-test" started with PID 98945

In the Gateway config set-up a Probe entry with the Dynamic Entities tab filled in. If you use your own names remember to also make changes to the other examples below.

:warning: As for all these XML configuration section, copy them using the icon in the top right (if visible) or manually selevct the XML and CTRL+C. Then, in the GSE select the parent configuration section and right click and select “Paste XML”

<probe name="otel-test">
    <hostname>localhost</hostname>
    <port>7036</port>
    <secure>true</secure>
    <dynamicEntities>
        <mappingType ref="otel-mapping-type"/>
        <collectionAgentParameters ref="otel-ca"/>
    </dynamicEntities>
</probe>

:warning: Update the hostname and port above to match the Netprobe you have either created or have already selected.

Now, under the Dynamic Entities section of the GSE you have to create one configuration in each of the four section:

  • Mapping:

    <mapping name="ca-self-monitor">
        <builtIn>
            <name>Collection Agent V3</name>
        </builtIn>
    </mapping>
    
  • Collectors:

    <collector name="otel-collector">
        <plugin>
            <open-telemetry>
                <OpenTelemetryCollector>
                    <yaml>
                        <data/>
                    </yaml>
                </OpenTelemetryCollector>
            </open-telemetry>
        </plugin>
    </collector>
    

    Note above that the <yaml><data/></yaml> section can be empty to accept all the default values. We will start filling in this section later, but this empty set of defaults will work.

  • Mapping Types:

    <mappingType name="otel-mapping-type">
        <collectors>
            <collector ref="otel-collector"/>
        </collectors>
        <mappings>
            <mapping ref="ca-self-monitor"/>
        </mappings>
    </mappingType>
    
  • Collection Agent Parameters:

    <collectionAgentParameters name="otel-ca">
        <managed>
            <yaml>
                <managedYAML>
                    <selfMonitoring>true</selfMonitoring>
                </managedYAML>
            </yaml>
        </managed>
    </collectionAgentParameters>
    

Start the Collection Agent

Start the Collection Agent (and Netprobe) and make sure that you see the Self-Monitoring data after about 30 seconds:

:warning: Note in the latest versions of Geneos the Entities view only shows Dynamic Entities in a Netprobe that are in an errored state. Use the headline, as indicated above, to check that one Dynamic Entity has been accepted. You can change this behaviour to show a row-per-entity by overriding the default sampler used. See the Dynamic Entities Health plugin documentation for more.

You should now have a Managed Entity with the same name as your HOSTNAME environment variable set in the Netprobe environment:

Once the OTel demo is connected and streaming to the Geneos Collection Agent then you will also see a large number of entries in InvalidMetrics and InvalidStreams under the Netprobe Info entity, like this:

This is not a bad thing as such, in fact it means that the OTel demo is successfully sending data to the Collection Agent. The next step is to start mapping the OTel signal to useable Geneos data.

Mappings

Geneos supports Dynamic Entities through mappings that allow you to make statements about the meaning of the signals you receive in the Collection Agent and how those should be displayed in Geneos.

OTel has the concept of semantic conventions to help ensure that signals reported by OTel can be used to add meaning to the collected data. The current specification is here: https://opentelemetry.io/docs/specs/semconv/

Let’s use these two things to build some simple Geneos configuration that can display signals from the OTel demo in a meaningful way.

You could use the InvalidMetrics and InvalidStreams Dataview to do this, but the Collection Agent logs are a much better place to start.

If you are using the latest geneos from the cordial repo, then you can do something like this:

geneos logs netprobe otel-test -C

This will output the last 10 lines from both the main Netprobe log file and also the Collection Agent log. You have to look at both log files as they each contain information you can use to create your mappings.

The main Netprobe logfile has entries like this:

===> netprobe "otel-test" /home/peter/geneos/netprobe/netprobes/otel-test/netprobe.log <===
2024-04-09 10:29:50.727+0100 ERROR: DynamicEntities No mapper matches the metric 'grpc.oteldemo.CurrencyService/Convert_latency|host.arch=amd64|host.name=9f0167b27676|os.type=linux|os.version=6.5.0-21-generic|process.command=/app/server.js|process.command_args=/usr/local/bin/node_--require_./Instrumentation.js_/app/server.js|process.executable.name=node|process.executable.path=/usr/local/bin/node|process.owner=nextjs|process.pid=17|process.runtime.description=Node.js|process.runtime.name=nodejs|process.runtime.version=18.20.1|service.name=frontend'. It will be ignored
2024-04-09 10:29:50.727+0100 ERROR: DynamicEntities No mapper matches the metric 'grpc.oteldemo.ProductCatalogService/ListProducts_latency|host.arch=amd64|host.name=9f0167b27676|os.type=linux|os.version=6.5.0-21-generic|process.command=/app/server.js|process.command_args=/usr/local/bin/node_--require_./Instrumentation.js_/app/server.js|process.executable.name=node|process.executable.path=/usr/local/bin/node|process.owner=nextjs|process.pid=17|process.runtime.description=Node.js|process.runtime.name=nodejs|process.runtime.version=18.20.1|service.name=frontend'. It will be ignored
2024-04-09 10:30:28.849+0100 ERROR: DynamicEntities No mapper matches the metric 'process.runtime.dotnet.gc.committed_memory.size|container.id=bb51603d87178c21eeb07715c49f6ec375638972569f195341d0af2e250289cb|host.name=bb51603d8717|service.name=cartservice'. It will be ignored
2024-04-09 10:30:28.849+0100 ERROR: DynamicEntities No mapper matches the metric 'process.runtime.dotnet.gc.heap.size|container.id=bb51603d87178c21eeb07715c49f6ec375638972569f195341d0af2e250289cb|host.name=bb51603d8717|service.name=cartservice'. It will be ignored
2024-04-09 10:30:28.849+0100 ERROR: DynamicEntities No mapper matches the metric 'process.runtime.dotnet.gc.heap.fragmentation.size|container.id=bb51603d87178c21eeb07715c49f6ec375638972569f195341d0af2e250289cb|host.name=bb51603d8717|service.name=cartservice'. It will be ignored

As you scroll across the lines you will see some common names, which are using the OTel semantic conventions. The two that jump out most often are host.name and service.name.

It may be tempting to immediately use these values as an Entity and Dataview in Geneos, but this isn’t always the case. The host.name value is the docker container ID and is not very descriptive and more importantly is dynamic and will change from one run of the demo environment to the next. service.name may be a better candidate. Also, service.namespace makes a good Geneos Attribute.

Let’s try this as a new Collector:

<mapping name="otel-basic">
    <custom>
        <geneosItems>
            <geneosItem>
                <label>service.name</label>
                <entity>
                    <required>true</required>
                    <useInDisplayName>true</useInDisplayName>
                </entity>
            </geneosItem>
            <geneosItem>
                <label>service.namespace</label>
                <attribute/>
            </geneosItem>
        </geneosItems>
    </custom>
    <localAttributes>
        <localAttribute>
            <name>COMPONENT</name>
            <value>
                <data>Astronomy Shop</data>
            </value>
        </localAttribute>
    </localAttributes>
</mapping>

Add this new Collector to your Mapping Type, replacing the initial one above with:

<mappingType name="otel-mapping-type">
    <collectors>
        <collector ref="otel-collector"/>
    </collectors>
    <mappings>
        <mapping ref="ca-self-monitor"/>
        <mapping ref="otel-basic"/>
    </mappings>
</mappingType>

If you save this you should now start seeing new Managed Entities pop up after about 30 seconds. But wait! There’s something wrong. The Dataview names are very long and include too much unwanted information. Similarly, in each Dataview there is are one or two rows with a very convoluted names instead of two column of data, like we would expect.

This is because all of the semantic properties from OTel are being converted to Geneos dimensions (and some Properties).

What we want to do is to stop all the attributes, except for the ones we are interested in, from being turned into Dimensions. We do this in the otel-collector configuration from above; If you remember we left the YAML part empty? Now, let’s fill it in:

Paste this into the YAML part of the otel-collector configuration. Be careful with spacing as YAML is highly sensitive to indentation:

resource-attributes:
  metrics:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id
  logs:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id
  traces:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id

data-attributes:
  metrics:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id
  logs:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id
  traces:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id

What this configuration does is to change the default behaviour of the resource-attributes from including all semantic names and the data-attributes from including all names to both only passing up the three attributes we have set. Everything else becomes a “Property”, which means that it is available to Geneos but not used for naming.

If you now save the configuration you should see useful data flowing through!

:warning: Note: Because of the flow of data during the configuration changes, you may see attributes other than those above in the Dataview and row names. To clean this out, restart the Collection Agent (and Netprobe):

geneos restart netprobe otel-test

State Tree

If you now right-click on a cell and run the Show Dynamic Mappings command you will see something like this:

Here you can see that only two Dimensions has made it through and all the other OTel attributes are Properties. The Geneos Attributes include both the service.name and the service.namespace.

Logs

The final piece of the puzzle is to be able to process the log entries being sent from OTel. To do this we need to add an FKM sampler to the otel-mapping-type.

First, we create the FKM sampler that subscribes to all streams, so that we can see what logs are being published:

<sampler name="logs">
    <var-group>
        <data>logs</data>
    </var-group>
    <plugin>
        <fkm>
            <files>
                <file>
                    <source>
                        <stream>
                            <data>*</data>
                        </stream>
                    </source>
                </file>
            </files>
        </fkm>
    </plugin>
</sampler>

Then, we add a reference to this sampler in otel-mapping-type, like this:

<mappingType name="otel-mapping-type">
    <collectors>
        <collector ref="otel-collector"/>
    </collectors>
    <mappings>
        <mapping ref="ca-self-monitor"/>
        <mapping ref="otel-basic"/>
    </mappings>
    <samplers>
        <sampler ref="logs"/>
    </samplers>
</mappingType>

If you now locate one of the Entities that contains a logs Dataview, you should be able to right-click on one of the streams data cells and run the View File command. If you select Continuous monitoring you will be able to watch the file stream updating in real-time. You can then begin using this to build FKM key tables to identify error conditions as normal. If you already know the likely log entries that need to be identified as error or exception conditions then you can go ahead and implement those too.

watch file stream

To use separate FKM samplers for different OTel log sources, so that you can use different FKM key tables, you can add multiple FKM samplers and as long as the stream name only matches the OTel log name then only that sampler will be applied to the log entries, e.g. setting up a new FKM sampler for kafka.* logs:


<sampler name="kafka-logs">
    <var-group>
        <data>logs</data>
    </var-group>
    <plugin>
        <fkm>
            <files>
                <file>
                    <source>
                        <stream>
                            <data>kafka.*</data>
                        </stream>
                    </source>
                </file>
            </files>
        </fkm>
    </plugin>
</sampler>

and adding this to the otel-mapping-type in addition to the catch-all logs sampler:


<mappingType name="otel-mapping-type">
    <collectors>
        <collector ref="otel-collector"/>
    </collectors>
    <mappings>
        <mapping ref="ca-self-monitor"/>
        <mapping ref="otel-basic"/>
    </mappings>
    <samplers>
        <sampler ref="logs"/>
        <sampler ref="kafka-logs"/>
    </samplers>
</mappingType>

Now a new sampler appears but only in those entities that publish logs with names beginning kafka.:

Unless you also change the original logs sampler, the same logs will continue to also appear there.

Troubleshooting

  • OTel demo in WSL with docker

    If you are running the OTel demo environment in WSL under docker you will have to “fix” the DNS resolver as the built-in local resolver will not work correctly.

    Edit /etc/wsl.conf (as root, using sudo etc.) and add:

    [network]
    generateResolvConf = false
    

    Then stop and restart WSL, e.g from a PowerShell prompt use wsl --shutdown to ensure the subsystem is actually stopped. Then start it as normal and create you own /etc/resolv.conf using an external DNS resolver, e.g.

    nameserver 8.8.8.8
    

    If you are hosting Geneos components on your WSL instance than you will have to be careful with the local host definitions and port forwarding, and you may nbeed to make further changes. It may be simplest, for testing, to use the WSL private IP instead of a hostname in the OTel collector extras file before starting the demo. See the comments in the /etc/hosts file in WSL for more info.

    Now start the demo and it should work correctly.

  • Missing Log Datapoints

    At the time of writing the Collection Agent OTel plugin is incorrectly dropping logs that have a zero EventTime. For example, in the demo the currencyservice should show a logs sampler but does not. This is expected to be fixed in a release soon.

    These are shown in the colleciton-agent.log like this:

    2024-04-11 10:02:23.330 [grpc-server-otel-collector-1] WARN  com.itrsgroup.collection.ca.workflow.pipelines.WorkflowPipeline(logs) - Dropping invalid data point: TIMESTAMP_OUT_OF_RANGE {Type=LOG_EVENT, Name=currencyservice, Namespace=itrsgroup.com/c2/opentelemetry-plugin/logs, Dimensions={service.name=currencyservice, service.namespace=opentelemetry-demo}, Properties={telemetry.sdk.version=1.13.0, telemetry.sdk.name=opentelemetry, telemetry.sdk.language=cpp, trace_id=806388ed2e79024859168d0502ad1430, span_id=f03263b113141f2b, ca_collector_name=otel-collector, itrsgroup.com/agent-timestamp=1712826143330}, EventTimestamp=0, ObservedTimestamp=1712826143330000000, Message=GetSupportedCurrencies successful, Count=1, Window={0 ns <- 0 ns}, SourceEventId=null, Severity=INFO} - skipped 9 similar log messages over the past 2 seconds
    
  • Other Errors in collection-agent.log

    The current Collection Agent OTel plugin does not support all signal types, and particularly Traces are not fully supported. The timings from traces are processed and used to build histograms but the plugin only supports delta times and not absolute times, so you will see these kinds of errors:

    2024-04-11 10:02:20.942 [grpc-server-otel-collector-2] WARN  com.itrsgroup.collection.plugins.opentelemetry.OpenTelemetryCollector(otel-collector) - No valid mapping for cumulative histogram - switch to delta aggregation if possible
    

Example Configuration

Below is a complete Gateway configuration include file with all of the above examples.

Please remember to change the Probe hostname and port and Secure flag to match your local instances.

<?xml version="1.0" encoding="ISO-8859-1"?>
<gateway compatibility="1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.itrsgroup.com/GA6.7.1-240223/gateway.xsd">
    <probes>
        <probe name="otel-test">
            <hostname>localhost</hostname>
            <port>7036</port>
            <secure>true</secure>
            <dynamicEntities>
                <mappingType ref="otel-mapping-type"/>
                <collectionAgentParameters ref="otel-ca"/>
            </dynamicEntities>
        </probe>
    </probes>
    <dynamicEntities>
        <simpleMappings>
            <mapping name="ca-self-monitor">
                <builtIn>
                    <name>Collection Agent V3</name>
                </builtIn>
            </mapping>
            <mapping name="otel-basic">
                <custom>
                    <geneosItems>
                        <geneosItem>
                            <label>service.name</label>
                            <entity>
                                <required>true</required>
                                <useInDisplayName>true</useInDisplayName>
                            </entity>
                        </geneosItem>
                        <geneosItem>
                            <label>service.namespace</label>
                            <attribute/>
                        </geneosItem>
                    </geneosItems>
                </custom>
                <localAttributes>
                    <localAttribute>
                        <name>COMPONENT</name>
                        <value>
                            <data>Astronomy Shop</data>
                        </value>
                    </localAttribute>
                </localAttributes>
            </mapping>
        </simpleMappings>
        <collectors>
            <collector name="otel-collector">
                <plugin>
                    <open-telemetry>
                        <OpenTelemetryCollector>
                            <yaml>
                                <data>resource-attributes:
  metrics:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id
  logs:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id
  traces:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id

data-attributes:
  metrics:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id
  logs:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id
  traces:
    base: none
    include:
      - service.namespace
      - service.name
      - service.instance.id
</data>
                            </yaml>
                        </OpenTelemetryCollector>
                    </open-telemetry>
                </plugin>
            </collector>
        </collectors>
        <mappingTypes>
            <mappingType name="otel-mapping-type">
                <collectors>
                    <collector ref="otel-collector"/>
                </collectors>
                <mappings>
                    <mapping ref="ca-self-monitor"/>
                    <mapping ref="otel-basic"/>
                </mappings>
                <samplers>
                    <sampler ref="logs"/>
                    <sampler ref="kafka-logs"/>
                </samplers>
            </mappingType>
        </mappingTypes>
        <collectionAgentParameters>
            <collectionAgentParameters name="otel-ca">
                <managed>
                    <yaml>
                        <managedYAML>
                            <selfMonitoring>true</selfMonitoring>
                        </managedYAML>
                    </yaml>
                </managed>
            </collectionAgentParameters>
        </collectionAgentParameters>
    </dynamicEntities>
    <samplers>
        <sampler name="logs">
            <var-group>
                <data>logs</data>
            </var-group>
            <plugin>
                <fkm>
                    <files>
                        <file>
                            <source>
                                <stream>
                                    <data>*</data>
                                </stream>
                            </source>
                        </file>
                    </files>
                </fkm>
            </plugin>
        </sampler>
        <sampler name="kafka-logs">
            <var-group>
                <data>logs</data>
            </var-group>
            <plugin>
                <fkm>
                    <files>
                        <file>
                            <source>
                                <stream>
                                    <data>kafka.*</data>
                                </stream>
                            </source>
                        </file>
                    </files>
                </fkm>
            </plugin>
        </sampler>
    </samplers>
</gateway>
1 Like