Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service tag wrong randomly after restarts #22341

Open
bsteve99 opened this issue Jan 31, 2025 · 5 comments
Open

Service tag wrong randomly after restarts #22341

bsteve99 opened this issue Jan 31, 2025 · 5 comments
Labels
sink: datadog_logs Anything `datadog_logs` sink related type: bug A code related bug.

Comments

@bsteve99
Copy link

bsteve99 commented Jan 31, 2025

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Seemingly randomly my service tag comes out as _RESERVED_service: blah after restarts. But restart again and it might be correct as service: blah ( when sending logging requests to datadog ). Maybe a race condition..???

The condition will not fix itself until I continue to restart vector to get it to work correctly.

I've tried versions 0.39.0 to 0.44.0 with the same issue occurring.

I've done "tap's" and do not see any difference in the json log output (prior to sending to DD); when it works correctly vs when it does not work correctly.

Configuration


Version

0.44.0

Debug Output


Example Data

No response

Additional Context

This is running on Linux.

References

No response

@bsteve99 bsteve99 added the type: bug A code related bug. label Jan 31, 2025
@pront
Copy link
Member

pront commented Jan 31, 2025

Hi @bsteve99, can you share your config (redact sensitive data first) and log samples so I can better understand the differences you attempted to describe above?

If I understand correctly, you are using a datadog_logs sink and it sounds related to this #22157 (comment).

@bsteve99
Copy link
Author

bsteve99 commented Jan 31, 2025

I actually may have stumbled on a fix, with many restarts and no issues.

I was using datadog_log sink and re-wrote it in http sink; and its looking good.

old sink here:
_srv_dd_log_out:
    type: datadog_logs
    inputs:
      - _strip_verbose_log_lines
    compression: gzip
    default_api_key: "${DD_API_KEY:?DD_API_KEY must be specified}"
    tls:
      enabled: true
    batch:
      max_events: ${DATADOG_LOG_BATCHSIZE:-200}
      timeout_secs: 3
    buffer:
      type: memory
      max_events: 20
      when_full: block
    request:
      concurrency: adaptive

new sink here.
  _srv_dd_log_out:
    type: http
    inputs:
      - _strip_verbose_log_lines
    compression: gzip
    uri: https://http-intake.logs.datadoghq.com/api/v2/logs
    method: post
    tls:
      verify_certificate: false
      verify_hostname: true
    batch:
      max_events: ${DATADOG_LOG_BATCHSIZE:-200}
      timeout_secs: 3
    encoding:
      codec: json
    buffer:
      type: memory
      max_events: 20
      when_full: block
    request:
      headers:
        DD-API-KEY: "${DD_API_KEY:?DD_API_KEY must be specified}"
        Content-Type: application/json
        Accept: application/json
      concurrency: adaptive

@pront
Copy link
Member

pront commented Jan 31, 2025

Sounds great. But also, I am curious why datadog_logs didn't work out of the box.

@pront pront added the sink: datadog_logs Anything `datadog_logs` sink related label Jan 31, 2025
@bsteve99
Copy link
Author

bsteve99 commented Jan 31, 2025

Yes, agree. But random _RESERVED_service should be enough info to see why this might happen randomly, as I said seems like a race condition that might be worth looking into.

The other detail I have is once this _RESERVED_service tag starts coming out, it never stops until we can restart vector. So some state is getting set; and never re-evaluating. Restart vector is the only solution; and once its fixed, it never messes it up ( until a restart may or may not cause it to be good or bad )

@pront
Copy link
Member

pront commented Feb 6, 2025

I've done "tap's" and do not see any difference in the json log output (prior to sending to DD); when it works correctly vs when it does not work correctly. I also like adding a console sink with the same inputs as the sink I want to debug.

It would help debug this further if we had some sample events. With vector tap you can easily to get some, guide here.

I suspect that your source is populating the service field and the datadog_logs sink wants to use that field. In an effort to avoid data loss we move the existing value under the _RESERVED_service event path and populate service with a new value.

If you are interested, the code is here:

// if an existing attribute exists here already, move it so to not overwrite it.
// yes, technically the rename path could exist, but technically that could always be the case.
if log.contains(desired_path) {
let rename_attr = format!("_RESERVED_{}", meaning);
let rename_path = event_path!(rename_attr.as_str());
warn!(
message = "Semantic meaning is defined, but the event path already exists. Renaming to not overwrite.",
meaning = meaning,
renamed = &rename_attr,
internal_log_rate_limit = true,
);
log.rename_key(desired_path, rename_path);
}

More debugging tools:

  • vector top to see component metrics - docs here
    • For example, you can inspect how many events are produced by sources vs how many events reach the sinks. In your case, you can compare http vs datadog_logs sink metrics.
    • You can also add an internal_metrics source and inspect the metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: datadog_logs Anything `datadog_logs` sink related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants