Logging and Tracing Architecture

Overview

Logging and tracing are foundational pillars of observability in modern distributed systems. In a microservices-based architecture, where services communicate across network and process boundaries, traditional debugging approaches are insufficient.

Logging captures discrete events, state changes, errors, and operational metadata generated by services over time.
Tracing provides an end-to-end, time-ordered view of a request as it propagates across multiple services, enabling visibility into latency, dependencies, and execution paths.

In distributed environments where services interact synchronously and asynchronously, observability must provide:

Deep visibility into runtime behavior
End-to-end request tracking
Operational intelligence for proactive monitoring
Forensic capability for audits and incident analysis

This document describes the centralized logging and tracing architecture implemented using OpenTelemetry (OTLP) as the standard telemetry format, Fluent Bit as the collection and forwarding agent, and OpenSearch for indexing, storage, and analytics.

The architecture ensures standardized telemetry ingestion, structured indexing, and cross-service correlation across the platform.

As needed, data can be routed to:

External/ Third SIEMs: e.g., Splunk
EF Log Monitor: Requires custom backend, custom frontend and query layer.

Architecture

Our observability architecture captures and routes log and trace data using a structured, modular pipeline:

Audit Log Generation in the Application

Application components emit logs and traces in standardized OTLP JSON format., which are automatically collected using Kubernetes-native logging mechanisms and forwarded to a central logging system. Audit logs are kept logically separate from standard application logs to allow independent access control, easier review by operations teams, and separate retention policies. Audit events are generated by the application in a structured JSON format. Each audit log includes mandatory attributes to ensure traceability and consistency.

Logging in Kubernetes Pods

Application containers write logs to standard output. Kubernetes handles log persistence at the node level. No audit logs are stored locally inside containers, reducing the risk of loss or tampering due to container restarts.

Log Collection Using Fluent Bit

Fluent Bit is deployed as a DaemonSet within the Kubernetes cluster. This ensures:

Logs are collected consistently from all application pods
Centralized control over log collection and processing

Log Identification & Filtering

Audit logs are identified using a deterministic field:

"type": "audit_logging"

Tracing logs are identified using a deterministic field:

"type": "tracing"

Filtering Logic

All container logs are collected
Logs with type="audit_logging" are filtered into the audit pipeline
Logs with type="tracing" are filtered into the tracing pipeline

Routing Logs to Destinations

Log Type

Destination

Audit Logs

Dedicated audit index / stream

audit_log_index

Tracing Logs

Dedicated tracing index / stream

tracing_index

Storage & Indexing - OpenSearch

OpenSearch acts as the centralized operational data store for logs.

It provides:

Scalable storage for high-volume log and trace data.
Efficient indexing for fast searches during incidents.
Support for high-cardinality fields such as trace IDs and request IDs.
Reliable retention and access to historical data for post-incident analysis.

Visualization & Analysis - OpenSearch Dashboards

OpenSearch Dashboards serves as the primary interface to monitor and analyze system behavior. Operational use cases include:

Searching and filtering logs in real time.
Correlating logs, traces, and audit records using traceId.

Dashboards can be customized to align with operational workflows and on-call requirements.

Data Retention and Governance

The platform does not enforce data retention policies. All retention, archival, and deletion rules are governed by the destination system (such as OpenSearch or the organization’s SIEM), allowing organizations to align data governance with regulatory, legal, and internal requirements.

Extensibility

The architecture supports:

Exporting audit logs to external SIEM platforms
Supporting multiple destinations in parallel

Testing and visualization are done using OpenSearch Dashboards.

Fluentbit can temporarily store events on disk if OpenSearch is unavailable.
Once OpenSearch is back, it automatically forwards the stored data.
Good for log-based event ingestion.

Logging Format

Data to be Logged

For each CRUD operation on a configuration item, the following information will be logged:

Field	Type	Description
`timestamp`	ISO 8601 date	When the event occurred (UTC timestamps in ISO 8601 format recommended). Example: `2024-11-28T09:00:00Z`
`user_id`	UUID	Unique identifier of the user who performed the action
`user_name`	string	Display name of the user
`action`	keyword	Operation performed: `CREATE`, `UPDATE`, or `DELETE`
`resource`	keyword	Entity acted upon, e.g. `teams`, `reasonCode`
`resource_id`	UUID	Unique identifier of the affected resource
`source_ip_address`	IP	IP address of the user at the time of the action
`attributes.service`	keyword	Name of the service that emitted the log e.g `UNIFIED_ADMIN`
`attributes.tenantId`	keyword	Tenant identifier
`attributes.updated_data`	object	Snapshot of the changed data (not searchable by design)
`type`	keyword	Log category: `audit_logging` or `tracing`
`level`	keyword	Severity level, e.g. `info`

Example log entry:

Audit logs will be stored in JSON format. This format is chosen for its flexibility, ease of parsing, and compatibility with OpenSearch.

JSON

{
  "timestamp": "2024-11-28T09:00:00Z",
  "user_id": "c7a904cc-262f-41f3-988a-351f6326e004",
  "user_name": "john doe",
  "action": "UPDATE",
  "resource": "teams",
  "resource_id": "3e0b50a2-64fa-4051-8d16-3db6408fddec",
  "source_ip_address": "192.168.1.100",
  "attributes": {
    "service": "unified_admin",
    "tenantId": "expertflow",
    "updated_data": {
      "team_name": "Test team",
      "description": "team for testing teams feature"
    }
  },
  "type": "audit_logging" 
}

All the above information is mandatory for logging

Tracing Format

Data to be Logged

For each CRUD operation on a configuration item, the following information will be logged:

Field	Type	Description
`timestamp`	ISO 8601 date	When the event occurred (UTC timestamps in ISO 8601 format recommended). Example: `2024-11-28T09:00:00Z`
`trace_id`	UUID	Unique identifier for a tracing (in our case corelation Id).
`parent_span_id`	UUID	This field indicates the `span_id` of the operation that created the current span.
`span_id`	UUID	Unique identifier for a single operation (a "span") within a trace.
`status`	keyword	Status of the log: `OK`, `ERROR`
`tenantId`	keyword	Tenant identifier
`resource`	keyword	Entity acted upon, e.g. `teams`, `reasonCode`
`service`	keyword	Name of the service that emitted the log e.g `UNIFIED_ADMIN`
`attributes.user_id`	UUID	Unique identifier of the user who performed the action
`attributes.user_name`	string	Display name of the user
`attributes.resource_id`	UUID	Unique identifier of the affected resource
`attributes.source_ip_address`	IP	IP address of the user at the time of the action
`attributes.method_name`	string	Function name where the action occurs
`type`	keyword	Log category: `audit_logging` or `tracing`
`log_level`	keyword	Severity level, e.g. `INFO`, `DEBUG`, `ERROR`
`message`	string	Description of the event.

Example log entry:

JSON

{
  "timestamp": "2024-11-28T09:00:00Z",
  "trace_id": "abcdef1234567890abcdef1234567890",
  "span_id": "1234567890abcdef", /optional
  "parent_span_id": "abcdef1234567890", /optional
  "service": "unified-admin",
  "tenantId": "expertflow",
  "operation": "create-team",
  "log_level": "INFO",  (INFO, DEBUG, ERROR)
  "message": "Team is saved successfully",
  "status": "OK",   (OK, ERROR)
  "attributes": {
    "source_ip_address": "192.168.1.100",
    "resource_id": "3e0b50a2-64fa-4051-8d16-3db6408fddec",
    "user_id": "user-123",
    "user_name": "test-agent",
    "method_name": "addMemberInTeam"
  },
  "error": {
    "error_code": "MONGO_CONNECT_TIMEOUT",
    "message": "Error occured while saving the team's agent",
    "stack_trace": "Error at saveAgent (teamService.js:45)",
    "cause": "Timeout while communicating with mongoDB"
  },
  "type": "tracing"
}

timestamp, trace_id, service, message, log_level, status, type are mandatory for tracing

Metrics Format

Data to be Logged

For each CRUD operation on a configuration item, the following information will be logged:

name: The metric being recorded (cpu_usage, http_request).
unit: Measurement unit (e.g., %, ms, bytes).
data_points: List of recorded values (each metric can have multiple timestamps).
- timestamp: When the data point was captured (UTC timestamps in ISO 8601 format recommended). Example: 2024-11-28T09:00:00Z
- value: The metric’s actual value (75.5% CPU usage).
- attributes: Extra labels like host.name or container_id.
- resource: Identifies the system or service that generated the metric (system-monitor, web-app)

Format

JSON

{
  "name": "cpu_usage",
  "unit": "%",
  "data_points": [
    {
      "timestamp": "2024-10-27T10:00:00Z",
      "value": 75.5,
      "attributes": {
        "host.name": "expertflow-test"
      },
      "resource": {
        "service.name": "system-monitor"
      }
    }
  ],
  "type": "metrics"
}

OpenTelemetry Collector, Zipkin, Jaeger, New Relic and Datadog are using the above three JSON for collecting logging, tracing and metrics information.

name, unit and at least one data-point is mandatory for metrics