Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only process unique certificates #2

Open
vanbroup opened this issue Apr 13, 2021 · 9 comments
Open

Only process unique certificates #2

vanbroup opened this issue Apr 13, 2021 · 9 comments
Labels
enhancement New feature or request

Comments

@vanbroup
Copy link
Contributor

The crt.sh query is including pre- and final- certificates, count might be higher than actual certificates with issues, should cover pre-certificates only.

@robstradling
Copy link
Contributor

@vanbroup CT compliance can be achieved without precertificates (i.e., SCTs can be sent via the signed_certificate_timestamp TLS extension or via OCSP stapling - see https://tools.ietf.org/html/rfc6962#section-3.3), so 44f22d2 will result in some certs (that don't have corresponding precerts) being missed.

@robstradling
Copy link
Contributor

Perhaps you feel it's accurate enough?

@vanbroup
Copy link
Contributor Author

vanbroup commented Apr 15, 2021

I saw you are using a sub query for this but suppose this comes at a much higher load. This with the simplicity of this method was the main reason to use this method for now.

Eventually we should be scanning all active certificates, and probably use your sub query, currently we are still optimizing the data sets.

How would you estimate the impact of the sub query?

@robstradling
Copy link
Contributor

I tried a few different approaches yesterday, but this is the best I could come up with:
robstradling@8866d3e

A quick comparison suggests that this is at least 5x slower than your current query (hence why that commit also reduces SELECT_LIMIT).

@vanbroup
Copy link
Contributor Author

Assuming we want to cover all certs, it would make sense to adopt that query, but if you suggest that it would only process about 100 certificates per minute (as the query timeout is 60 seconds) this is never going to finish.

I was successfully running in batches of 20.000 with fifty workers (without filtering for pre-certificates) and could complete a 110.000.000 certificate scan in a few days. But if you don't mind the load on crt.sh I'm happy to try this new query as it's better for the results.

@robstradling
Copy link
Contributor

I don't think there's any problem with the load on crt.sh. It's just the increase in total runtime that's undesirable.

Here's a partial implementation of an alternative approach, which would avoid that ~5x performance penalty for the SELECT query but which would require the result-set to be post-processed(*):
robstradling@fd1211f

(*) You would need to also feed the SHA-256(TBSCertificate with CT extensions removed) field into the CSV output file, then post-process the CSV data so that rows with the same SHA-256(TBSCertificate with CT extensions removed) are deduplicated.

@robstradling
Copy link
Contributor

Ooh, another, much simpler alternative would be to stick with the original query (85593b7), but change the first column of the CSV output to be https://crt.sh/?serial=<serialnumber> instead of https://crt.sh/?sha256=<SHA-256(Certificate)>.
Then simply pipe the CSV output through sort and uniq to deduplicate.

This is a less strict approach to deduplicating, but it should produce the same result (except for any rare cases where an Issuing CA incorrectly reuses a serial number).

@vanbroup
Copy link
Contributor Author

Post processing is probably a viable alternative, but it would initially process about 30 million certificates more than needed, in the worst-case scenario we would process 50% more than needed.

Would it be possible to query all certificates that have no pre-certificate logged?

@vanbroup vanbroup changed the title Don't include final certificates Only process unique certificates Apr 15, 2021
@vanbroup vanbroup reopened this Apr 15, 2021
@vanbroup vanbroup added the enhancement New feature or request label Apr 15, 2021
@robstradling
Copy link
Contributor

in the worst-case scenario we would process 50% more than needed.

But you'd still be processing all of that at least 5x more quickly (not including the sort | uniq step) than the alternative (robstradling@8866d3e), which I think is a net win.

Would it be possible to query all certificates that have no pre-certificate logged?

There's no index to help with that, so it would be even less efficient than each of the other options we've discussed so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants