-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only process unique certificates #2
Comments
@vanbroup CT compliance can be achieved without precertificates (i.e., SCTs can be sent via the |
Perhaps you feel it's accurate enough? |
I saw you are using a sub query for this but suppose this comes at a much higher load. This with the simplicity of this method was the main reason to use this method for now. Eventually we should be scanning all active certificates, and probably use your sub query, currently we are still optimizing the data sets. How would you estimate the impact of the sub query? |
I tried a few different approaches yesterday, but this is the best I could come up with: A quick comparison suggests that this is at least 5x slower than your current query (hence why that commit also reduces SELECT_LIMIT). |
Assuming we want to cover all certs, it would make sense to adopt that query, but if you suggest that it would only process about 100 certificates per minute (as the query timeout is 60 seconds) this is never going to finish. I was successfully running in batches of 20.000 with fifty workers (without filtering for pre-certificates) and could complete a 110.000.000 certificate scan in a few days. But if you don't mind the load on crt.sh I'm happy to try this new query as it's better for the results. |
I don't think there's any problem with the load on crt.sh. It's just the increase in total runtime that's undesirable. Here's a partial implementation of an alternative approach, which would avoid that ~5x performance penalty for the SELECT query but which would require the result-set to be post-processed(*): (*) You would need to also feed the SHA-256(TBSCertificate with CT extensions removed) field into the CSV output file, then post-process the CSV data so that rows with the same SHA-256(TBSCertificate with CT extensions removed) are deduplicated. |
Ooh, another, much simpler alternative would be to stick with the original query (85593b7), but change the first column of the CSV output to be This is a less strict approach to deduplicating, but it should produce the same result (except for any rare cases where an Issuing CA incorrectly reuses a serial number). |
Post processing is probably a viable alternative, but it would initially process about 30 million certificates more than needed, in the worst-case scenario we would process 50% more than needed. Would it be possible to query all certificates that have no pre-certificate logged? |
But you'd still be processing all of that at least 5x more quickly (not including the
There's no index to help with that, so it would be even less efficient than each of the other options we've discussed so far. |
The crt.sh query is including pre- and final- certificates, count might be higher than actual certificates with issues, should cover pre-certificates only.
The text was updated successfully, but these errors were encountered: