Skip to content

Commit

Permalink
trace-load-limits: restrict trace load for applications (#671)
Browse files Browse the repository at this point in the history
* trace-load-limits: restrict trace load for applications

In a shared system many developers log as much as they
want into DLT. This can lead into overloading the logging
system resulting in bad performance and dropped logs.
This commit introduces trace load limits to restrict
applications to a certain log volume measured in bytes/s.
It is based on #134 but extends this heavily.

Trace load limits are configured via a space separted
configuraiton file.
The format of the file follows:

APPID [CTXID] SOFT_LIMIT HARD_LIMIT
The most matching entry will be selected for each log, meaning that
either app and context must match or at least the app id, for which a
default is created when not configured.
This allows to configure trace load for single contexts which can be
used for example to limit different applications in the qnx slog to a
given budget without affecting others or to give a file transfer
unlimited bandwidth.
It is recommended to always specify a budget for the application id
without the contexts to ensure new contexts and internal logs like DLTL
can be logged.

Applications are starting up with a default limit defined
via CMake variables TRACE_LOAD_USER_HARD_LIMIT, TRACE_LOAD_USER_SOFT_LIMIT.
As soon as the connection to the daemon was succesull each configuration
entry matching the app id will be sent to the client via an
application message.
If no configuration is found TRACE_LOAD_DAEMON_HARD_LIMIT and
TRACE_LOAD_USER_SOFT_LIMIT will be used instead.
The two staged configuration process makes sure that the daemon default
can be set to 0, forcing developers to define a limit for their
application while making sure that applications are able to log before
they received the log levels.

Measuring the trace load is done in the daemon and the client.
Dropping messages on the client side is the primary mechanism and
prevents sending logs to the daemon only to be dropped there, which
reduces the load of the IPC. Measuring again on daemon side makes
sure that rouge clients are still limited to the trace load defined.

Exceeding the limit soft will produce the following logs:
ECU1- DEMO DLTL log warn V 1 [Trace load exceeded trace soft limit on apid: DEMO. (soft limit: 2000 bytes/sec, current: 2414 bytes/sec)]
ECU1- DEMO DLTL log warn V 1 [Trace load exceeded trace soft limit on apid: DEMO, ctid TEST.(soft limit: 150 bytes/sec, current: 191 bytes/sec)]

Exceeding the hard limit will produce the same message but the text
'42 messages discarded.' is appended and messages will be dropped.
Dropped messages are lost and cannot be recovered, which forces
developers to get their logging volume under control.

As debug and trace load are usually disabled for production and thus do
not impact the performance of actual systems these logs are not
accounted for in trace load limits. In practice this turned out to be
very usefull to improve developer experience while maintaining good
performance, as devs know that debug and trace logs are only available
during development and important information has to be logged on a
higher level.

To simplify creating a trace limit base line the script
'utils/calculate-load.py' is provided which makes suggestions
for the limits based on actual log volume.

Signed-off-by: Alexander Mohr <[email protected]>

* trace-looad: fix review findings

* Add documentation
* remove broken statistics
* set defaults for unit tests differently.

* trace-load: protect settings with rwlock

Signed-off-by: Alexander Mohr <[email protected]>

* review: remove redundant ifdef

Signed-off-by: Alexander Mohr <[email protected]>

* review: fix malloc findings

Signed-off-by: Alexander Mohr <[email protected]>

* fix potential dead lock when dlt_free is not used correct

Signed-off-by: Alexander Mohr <[email protected]>

---------

Signed-off-by: Alexander Mohr <[email protected]>
  • Loading branch information
alexmohr authored Sep 27, 2024
1 parent 358ab08 commit 1533d0f
Show file tree
Hide file tree
Showing 23 changed files with 2,708 additions and 37 deletions.
49 changes: 49 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,8 @@ option(WITH_DLT_QNX_SYSTEM "Set to ON to build QNX system binary dlt-qnx-system"
option(WITH_DLT_FILE_LOGGING_SYSLOG_FALLBACK "Set to ON to enable fallback to syslog if dlt logging to file fails" OFF)
option(WITH_DLT_NETWORK_TRACE "Set to ON to enable network trace (if message queue is supported)" ON)
option(WITH_DLT_LOG_LEVEL_APP_CONFIG "Set to ON to enable default log levels based on application ids" OFF)
option(WITH_DLT_TRACE_LOAD_CTRL "Set to ON to enable trace load control in libdlt and dlt-daemon" OFF)


set(DLT_IPC "FIFO"
CACHE STRING "UNIX_SOCKET,FIFO")
Expand Down Expand Up @@ -370,6 +372,52 @@ if(WITH_DLT_LOG_LEVEL_APP_CONFIG)
add_definitions(-DDLT_LOG_LEVEL_APP_CONFIG)
endif()


if(WITH_DLT_TRACE_LOAD_CTRL)
add_definitions(-DDLT_TRACE_LOAD_CTRL_ENABLE)

# Configure limits for client
if(NOT TRACE_LOAD_USER_SOFT_LIMIT)
set(TRACE_LOAD_USER_SOFT_LIMIT 83333)
endif()

if(NOT TRACE_LOAD_USER_HARD_LIMIT)
set(TRACE_LOAD_USER_HARD_LIMIT 100000)
endif()

if (TRACE_LOAD_USER_SOFT_LIMIT GREATER TRACE_LOAD_USER_HARD_LIMIT)
message(FATAL_ERROR "TRACE_LOAD_USER_SOFT_LIMIT must be less or equal than TRACE_LOAD_USER_HARD_LIMIT")
endif()

add_definitions(-DDLT_TRACE_LOAD_CLIENT_HARD_LIMIT_DEFAULT=${TRACE_LOAD_USER_HARD_LIMIT})
add_definitions(-DDLT_TRACE_LOAD_CLIENT_SOFT_LIMIT_DEFAULT=${TRACE_LOAD_USER_SOFT_LIMIT})

# Configure limits for daemon
if(NOT TRACE_LOAD_DAEMON_SOFT_LIMIT)
if (WITH_DLT_UNIT_TESTS)
set(TRACE_LOAD_DAEMON_SOFT_LIMIT 83333)
else()
set(TRACE_LOAD_DAEMON_SOFT_LIMIT 0)
endif()
endif()

if(NOT TRACE_LOAD_DAEMON_HARD_LIMIT)
if (WITH_DLT_UNIT_TESTS)
set(TRACE_LOAD_DAEMON_HARD_LIMIT 100000)
else()
set(TRACE_LOAD_DAEMON_HARD_LIMIT 0)
endif()
endif()

if (TRACE_LOAD_DAEMON_SOFT_LIMIT GREATER TRACE_LOAD_DAEMON_HARD_LIMIT)
message(FATAL_ERROR "TRACE_LOAD_DAEMON_SOFT_LIMIT must be less or equal than TRACE_LOAD_DAEMON_HARD_LIMIT")
endif()

add_definitions(-DDLT_TRACE_LOAD_DAEMON_HARD_LIMIT_DEFAULT=${TRACE_LOAD_DAEMON_HARD_LIMIT})
add_definitions(-DDLT_TRACE_LOAD_DAEMON_SOFT_LIMIT_DEFAULT=${TRACE_LOAD_DAEMON_SOFT_LIMIT})

endif(WITH_DLT_TRACE_LOAD_CTRL)

add_subdirectory(doc)
add_subdirectory(src)
add_subdirectory(include)
Expand Down Expand Up @@ -489,6 +537,7 @@ message(STATUS "WITH_EXTENDED_FILTERING = ${WITH_EXTENDED_FILTERING}")
message(STATUS "WITH_DLT_DISABLE_MACRO = ${WITH_DLT_DISABLE_MACRO}")
message(STATUS "WITH_DLT_FILE_LOGGING_SYSLOG_FALLBACK = ${WITH_DLT_FILE_LOGGING_SYSLOG_FALLBACK}")
message(STATUS "WITH_DLT_LOG_LEVEL_APP_CONFIG = ${WITH_DLT_LOG_LEVEL_APP_CONFIG}")
message(STATUS "WITH_DLT_TRACE_LOAD_CTRL = ${WITH_DLT_TRACE_LOAD_CTRL}" )
message(STATUS "Change a value with: cmake -D<Variable>=<Value>")
message(STATUS "-------------------------------------------------------------------------------")
message(STATUS)
4 changes: 4 additions & 0 deletions doc/dlt-daemon.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,10 @@ COVESA system or more likely on a external tester device.
: Load an alternative configuration for app id log level defaults.
By default, the configuration file /etc/dlt-log-levels.conf is loaded.

-l

: Load an alternative trace load configuration file. By default the configuration file /etc/dlt-trace-load.conf is loaded.


# EXAMPLES

Expand Down
66 changes: 66 additions & 0 deletions doc/dlt_trace_limit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
- [Trace load limits](#trace-load-limits)
- [Overview](#overview)
- [Technical concept](#technical-concept)

# Trace load limits

## Overview

In a shared system, many developers log as much as they want into DLT. It can overload the logging system, resulting in poor performance and dropped logs. This feature introduces trace load limits to restrict applications to a specified log volume measured in bytes/s.

Trace load limits are configured via a space-separated configuration file.
The format of the file follows:

> APPID [CTXID] SOFT_LIMIT HARD_LIMIT
The entry with the most matching criteria will be selected for each log, meaning the app and context or the app ID must match. For non-configured applications, default limits are applied.
It allows configuring trace load for single contexts, which can be used, for example, to limit different applications in the QNX slog to a given budget without affecting others or to provide a file transfer unlimited bandwidth.
You should always specify a budget for the application ID without the contexts to ensure that new contexts and internal logs like DLTL can be logged.

Applications start with a default limit defined via CMake variables TRACE_LOAD_USER_HARD_LIMIT, TRACE_LOAD_USER_SOFT_LIMIT.
Once the connection to the daemon is established, each configuration entry matching the app ID will be sent to the client via application message. If no configuration is found, TRACE_LOAD_DAEMON_HARD_LIMIT and TRACE_LOAD_USER_SOFT_LIMIT will be used instead.
The two-stage configuration process ensures that the daemon default can be set to 0, forcing developers to define a limit for their application while ensuring that applications can log before they receive the log levels.

Measuring the trace load is done in the daemon and the client.
Dropping messages on the client side is the primary mechanism and prevents sending logs to the daemon only to be dropped there,
reducing the IPC's load. Measuring again on the daemon side ensures that rouge clients are still limited to the defined trace load.

Exceeding the soft limit will produce the following logs:

```
ECU1- DEMO DLTL log warn V 1 [Trace load exceeded trace soft limit on apid: DEMO. (soft limit: 2000 bytes/sec, current: 2414 bytes/sec)]
ECU1- DEMO DLTL log warn V 1 [Trace load exceeded trace soft limit on apid: DEMO, ctid TEST.(soft limit: 150 bytes/sec, current: 191 bytes/sec)]
```

Exceeding the hard limit will produce the same message but with the appended text '42 messages discarded.' And (obviously) messages will be dropped. Dropped messages are lost and cannot be recovered, which forces developers to get their logging volume under control.

As debug and trace load are usually disabled for production and thus do not impact the performance of actual systems, these logs are not accounted for in trace load limits. In practice, this was very useful for improving the developer experience while maintaining good performance, as developers know that debugging and trace logs are only available during development and that essential information has to be logged at a higher level.

To simplify creating a trace limit baseline, the script 'utils/calculate-load.py' is provided, which suggests limits based on actual log volume.

## Technical concept

The bandwidth is calculated based on the payload of each message of every application or context. Each limit gets assigned several slots (default 60s) with a width of 1s, called a window. The sum of all slots represents the current bandwidth.

The `dlt_check_trace_load` function is the main entry point for handling trace load. It receives the payload size and the trace load configuration (among other parameters, but these are the most important) and returns whether the message should be kept. A return value of `true` means the message can be logged, while `false` means it should be dropped. This method is called for each message received by the daemon and for each message sent by the client. Doing this on the client side ensures that the daemon is not overloaded by clients sending too much data. Checking it on the daemon side ensures that rouge clients are still limited to the defined trace load.

Within the `dlt_check_trace_load` function, the first step is to check if the current slot is still valid by checking its time frame. It is done in the `dlt_switch_slot_if_needed` function, which also does sanity checks and cleans the window if necessary.

Cleaning does the following things:

1. Check if a timestamp overflow occurred, which will happen after ~119 hours. In this case, the time calculation will be reset.
2. Calculate the number of elapsed slots since receiving the last message.
3. The window is reset if the calculated amount of elapsed slots is larger than the window size (`total_bytes_of_window`).
4. Otherwise, remove data from slots that are not in the current window anymore.
5. Output errors when the limits are exceeded.

After the preparation, the actual recording of the trace load is calculated. The calculation is done in the following way:

1. Add the payload size to the current slot.
2. Increment the total size of the window.
3. Save the current slot to the window so that we can use this in the next iteration.
4. Calculate the average trace load of the window. It is an average in byte/s over the window size. The average value is used to ensure that short spikes do not immediately lead to dropped messages.

The trace load configuration contains two flags, `is_over_soft_limit` and `is_over_hard_limit`. When the limits are exceeded, they will be set.
If a new message exceeds the hard limits, it will be dropped, and its size will be removed from the window as the message is dropped.
It will also increment the `hard_limit_over_counter`, which counts the number of dropped messages. `hard_limit_over_counter` is a part of the message logged in case of exceeding the hard limit.
122 changes: 122 additions & 0 deletions include/dlt/dlt_common.h
Original file line number Diff line number Diff line change
Expand Up @@ -828,6 +828,128 @@ extern "C"
{
# endif


#ifdef DLT_TRACE_LOAD_CTRL_ENABLE
/* For trace load control feature */

#include <pthread.h>
/* For trace load control */

/* Number of slots in window for recording trace load (Default: 60)
* Average trace load in this window will be used as trace load
* Older time data than this size will be removed from trace load
*/
#define DLT_TRACE_LOAD_WINDOW_SIZE (60)

/* Window resolution in unit of timestamp (Default: 10000 x 0.1 msec = 1 sec)
* This value is same as size of 1 slot of window.
* Actual window size in sec can be calculated by
* DLT_TRACE_LOAD_WINDOW_SIZE x DLT_TRACE_LOAD_WINDOW_RESOLUTION / DLT_TIMESTAMP_RESOLUTION.
* (Default: 60 x 10000 / 10000 = 60 sec)
* FIXME: When timestamp resolution of dlt is changed from 0.1 msec,
* then DLT_TRACE_LOAD_WINDOW_RESOLUTION value also has to be updated accordingly.
*/
#define DLT_TRACE_LOAD_WINDOW_RESOLUTION (10000)

/* Special Context ID for output soft_limit/hard_limit over warning message (DLT LIMITS) */
#define DLT_TRACE_LOAD_CONTEXT_ID ("DLTL")

/* Frequency in which warning messages are logged in seconds when an application is over the soft limit
* Unit of this value is Number of slot of window.
* NOTE: Size of the slot depends on value of DLT_TRACE_LOAD_WINDOW_RESOLUTION
* (Default: 10 slots = 10000 x 0.1 msec = 10 sec)
*/
#define DLT_SOFT_LIMIT_WARN_FREQUENCY (10)

/* Frequency in which warning messages are logged in seconds when an application is over the hard limit
* Unit of this value is Number of slot of window.
* NOTE: Size of the slot depends on value of DLT_TRACE_LOAD_WINDOW_RESOLUTION
* (Default: 10 slots = 10000 x 0.1 msec = 10 sec)
*/
#define DLT_HARD_LIMIT_WARN_FREQUENCY (10)

/* Timestamp resolution of 1 second (Default: 10000 -> 1/10000 = 0.0001sec = 0.1msec)
* This value is defined as reciprocal of the resolution (1 / DLT_TIMESTAMP_RESOLUTION)
* FIXME: When timestamp resolution of dlt is changed from 0.1 msec,
* then DLT_TIMESTAMP_RESOLUTION value also has to be updated accordingly.
*/
#define DLT_TIMESTAMP_RESOLUTION (10000)

typedef struct
{
// Window for recording total bytes for each slots [bytes]
uint64_t window[DLT_TRACE_LOAD_WINDOW_SIZE];
uint64_t total_bytes_of_window; // Grand total bytes of whole window [bytes]
uint32_t curr_slot; // Current slot No. of window [slot No.]
uint32_t last_slot; // Last slot No. of window [slot No.]
uint32_t curr_abs_slot; // Current absolute slot No. of window [slot No.]
uint32_t last_abs_slot; // Last absolute slot No. of window [slot No.]
uint64_t avg_trace_load; // Average trace load of whole window [bytes/sec]
uint32_t hard_limit_over_counter; // Discarded message counter due to hard limit over [msg]
uint32_t hard_limit_over_bytes; // Discarded message bytes due to hard limit over [msg]
uint32_t slot_left_soft_limit_warn; // Slot left to output next warning of soft limit over [slot No.]
uint32_t slot_left_hard_limit_warn; // Slot left to output next warning of hard limit over [slot No.]
bool is_over_soft_limit; // Flag if trace load has been over soft limit
bool is_over_hard_limit; // Flag if trace load has been over hard limit
} DltTraceLoadStat;

/*
* The parameter of trace load settings
*/
typedef struct
{
char apid[DLT_ID_SIZE]; /**< Application id for which the settings are valid */
char ctid[DLT_ID_SIZE]; /**< Context id for which the settings are valid, this is optional */

uint32_t soft_limit; /**< Warning threshold, if load is above soft limit a warning will be logged but message won't be discarded */
uint32_t hard_limit; /**< limit threshold, if load is above hard limit a warning will be logged and message will be discarded */

DltTraceLoadStat tl_stat;
} DltTraceLoadSettings;

extern pthread_rwlock_t trace_load_rw_lock;

#ifndef UINT32_MAX
#define UINT32_MAX 0xFFFFFFFF
#endif

/* Precomputation */
static const uint64_t TIMESTAMP_BASED_WINDOW_SIZE = DLT_TRACE_LOAD_WINDOW_SIZE * DLT_TRACE_LOAD_WINDOW_RESOLUTION;
typedef DltReturnValue (DltLogInternal)(DltLogLevelType loglevel, const char *text, void* params);

/**
* Check if the trace load is within the limits.
* This adds the current message size to the trace load and checks if it is within the limits.
* It's the main entry point for trace load control.
* The function also handles output of warning messages when the trace load is over the limits.
* @param tl_settings The trace load settings for the message source.
* @param log_level Which log level should be used to output the trace load exceeded messages.
* @param timestamp The timestamp of the message, used to calculate the current slot.
* @param size Size of the payload.
* @param internal_dlt_log Function to output the trace load exceeded messages.
* @param internal_dlt_log_params Additional parameters for the internal_dlt_log function.
* @return True if the trace load is within the limits, false otherwise.
* False means that the message should be discarded.
*/
bool dlt_check_trace_load(
DltTraceLoadSettings* tl_settings,
int32_t log_level,
uint32_t timestamp,
int32_t size,
DltLogInternal internal_dlt_log,
void *internal_dlt_log_params);

/**
* Find the runtime trace load settings for the given application id and context id.
* @param settings Array with all settings
* @param settings_count Size of settings
* @param apid The apid to search for
* @param ctid The context id to search for, can be NULL
* @return A sorted array with all settings that match the given apid and ctid
*/
DltTraceLoadSettings* dlt_find_runtime_trace_load_settings(DltTraceLoadSettings *settings, uint32_t settings_count, const char* apid, const char* ctid);
#endif

/**
* Helper function to print a byte array in hex.
* @param ptr pointer to the byte array.
Expand Down
3 changes: 3 additions & 0 deletions include/dlt/dlt_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@ typedef unsigned int speed_t;
*/
typedef enum
{
#ifdef DLT_TRACE_LOAD_CTRL_ENABLE
DLT_RETURN_LOAD_EXCEEDED = -9,
#endif
DLT_RETURN_FILESZERR = -8,
DLT_RETURN_LOGGING_DISABLED = -7,
DLT_RETURN_USER_BUFFER_FULL = -6,
Expand Down
3 changes: 3 additions & 0 deletions include/dlt/dlt_user.h.in
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,9 @@ typedef struct
DltUserConnectionState connection_state;
# endif
uint16_t log_buf_len; /**< length of message buffer, by default: DLT_USER_BUF_MAX_SIZE */
#ifdef DLT_TRACE_LOAD_CTRL_ENABLE
pthread_rwlock_t trace_load_limit_lock;
#endif
} DltUser;

typedef int (*dlt_injection_callback_id)(uint32_t, void *, uint32_t, void *);
Expand Down
6 changes: 6 additions & 0 deletions src/daemon/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -104,3 +104,9 @@ if (WITH_DLT_LOG_LEVEL_APP_CONFIG)
DESTINATION ${CONFIGURATION_FILES_DIR}
COMPONENT base)
endif()

if (WITH_DLT_TRACE_LOAD_CTRL)
INSTALL(FILES dlt-trace-load.conf
DESTINATION ${CONFIGURATION_FILES_DIR}
COMPONENT base)
endif()
Loading

0 comments on commit 1533d0f

Please sign in to comment.