Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Sanitization #769

Closed
jclusso opened this issue Nov 18, 2021 · 16 comments · Fixed by #819
Closed

Query Sanitization #769

jclusso opened this issue Nov 18, 2021 · 16 comments · Fixed by #819

Comments

@jclusso
Copy link

jclusso commented Nov 18, 2021

Is there a reason all database queries for MongoDB and Redis are wiped out with a ?. This makes it hard to trace the query if you don't know anything about it. When comparing to Elastic APM I can see the entire query for a traced transaction.

@shairyar
Copy link
Member

@jclusso can you drop us more details at [email protected] and share the query you see and how you wish it should appear instead?

@jclusso
Copy link
Author

jclusso commented Nov 19, 2021

@shairyar I feel the sanitization is not ideal. It's very hard to tell what your queries in a trace are for when everything is wiped out with question marks. We should sanitize the values of hashes so we can still see what is going on while maintaining the data privacy. Additionally, I think there should be options for this.

I propse we introduce a new configuration option called filter_query_parameters which would be an array of keys to filter. To maintain the current level of privacy, the default option would be ["*"] which would filter all keys.

Here is an example of the new way things could be sanitized.

{ 
  name: "Jarrett", 
  location: { state: "NY", country: "US" }, 
  logins: [
    { date: "1/1/2021", location: { state: "NY", country: "US" },
    { date: "1/7/2021", location: { state: "NJ", country: "US" }  
  ] 
}

should sanitize to

{ 
  name: "?", 
  location: { state: "?", country: "?" }, 
  logins: [
    { date: "?", location: { state: "?", country: "?" },
    { date: "?", location: { state: "?", country: "?" }  
  ] 
}

@shairyar
Copy link
Member

@shairyar
Copy link
Member

Thanks for reporting it, it's something we will be improving in the future. We will update this issue when progress has been made however there is no timeline available yet.

@jclusso
Copy link
Author

jclusso commented Nov 23, 2021

@shairyar we may work on improving this ourselves. If we do I'll make a PR. If you have any details you think are important, let me know.

@shairyar
Copy link
Member

Sure, feel free to ping us if you have any questions while you are working on it.

@jclusso
Copy link
Author

jclusso commented Nov 23, 2021

@shairyar I've gotten started on some changes here. There will be two new configuration options, a boolean for filter_redis_arguments and an array for filter_query_parameters.

Redis

This was easy to sort out. Since Redis doesn't really have query parameters, I figured it made the most sense to either sanitize all or nothing. There is a 1,000 character limit to prevent gigantic strings of data and arguments that would just clog things up.

Mongo Ruby Driver

I've reworked how the strategies work. The strategies now take into account the filter_query_parameters option and adjust the output accordingly. A few key changes from the original sanitization:

  • filiter_query_parameters set to ["*"] is the default value and provides almost the same sanitization as before, minus my few improvements mentioned below.
  • Any key at any nesting level that is inside filter_query_parameters will have it's value sanitized with a ?.
  • Removed the key sanitizer because it overcomplicates things and I can't think of any example where data would be in keys that needs to be sanitized. Additionally, it makes it much easier to identify what is going on by leaving the keys in.
  • Array sanitization now returns the first 10 items followed by "[...]" if there are more than 10.
  • Added support for aggregations. Right now I treat the entire aggregation with the sanitze_document strategy. I'm not sure how I want to handle this, but it seems very time consuming to manage and upkeep all the potential aggregation pipeline options.
  • All of this new stuff needs new tests to cover them.
  • All existing tests are updated and passing for original ways.

Elasticsearch and Moped

Both of these use the updated QueryParamsSanitizer. Any improvements to that will be impacted here. Unlike the Mongo Ruby Driver I left Moped alone. I can't imagine Moped is heavily used anymore and didn't seem worth the effort to further improve.

All of this can be seen so far here.

Let me know what you think so far.

@shairyar
Copy link
Member

I had a discussion about this with my colleague. We don't think we want to start adding filter_query_parameters options at the moment. We can improve how it works probably, but let's not add a config option to filter out select keys only for mongodb

@matsimitsu
Copy link
Member

matsimitsu commented Nov 24, 2021

As MongoDB/Elastic users ourselves I'd love to get more insight into the queries!

We do have the issue that we use this sanitized body to group events together (in the slow query list and for event metrics).

Ideally, the sanitized value would be the same no matter the query details for that specific query.
(e.g. a query with an array should return the same value for 1 item in the array and 100).

Array sanitization now returns the first 10 items followed by "[...]" if there are more than 10.

This would then probably have to be written such that it emits only the first item, to prevent us from detecting 11 more variants of this query. (1-11 items).

Removed the key sanitizer because it overcomplicates things and I can't think of any example where data would be in keys that needs to be sanitized. Additionally, it makes it much easier to identify what is going on by leaving the keys in.

We added the key sanitizer because it's possible to dig into embedded documents with the . notation, e.g. tags.0.name and tags.1.name or metrics.load_avg.minute.1.value and metrics.disk_usage.minute.59.value, ideally this would also return the same sanitized value.

We could also look into perhaps using different formatters for the display value vs the value we use to determine the uniqueness of queries.

@jclusso
Copy link
Author

jclusso commented Nov 24, 2021

Honestly, for us this (combined with some other things) is pretty much a deal breaker at the moment. I'll leave this code as is for you guys to figure out since it seems it will require changes internally for anything like this to ever work. We'll be happy to try again in the future.

@luismiramirez luismiramirez self-assigned this Jan 31, 2022
@luismiramirez
Copy link
Member

I will remove bulk sanitization and do a few tests to see how it behaves in the app to make a decision.

@luismiramirez
Copy link
Member

Hi @jclusso! The sanitization is now way less restrictive in the last version (3.0.21).

@jclusso
Copy link
Author

jclusso commented Feb 15, 2022

@luismiramirez that's great to hear! Any plans on the Redis queries?

@luismiramirez
Copy link
Member

@jclusso Sure! you can keep track of it here: #822

@jclusso
Copy link
Author

jclusso commented Oct 24, 2023

@luismiramirez was testing you guys out again and noticed you still don't output anything at all for aggregations. Is there any plans to add support for that?

@luismiramirez
Copy link
Member

@jclusso Hi there!

We've been focused on the Python integration and other stuff during this time. I'll bring this to the team to see if we can tackle it soon.

I will let you know as soon as we start working on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants