Only process unique certificates #2

vanbroup · 2021-04-13T11:23:54Z

The crt.sh query is including pre- and final- certificates, count might be higher than actual certificates with issues, should cover pre-certificates only.

robstradling · 2021-04-14T21:00:48Z

@vanbroup CT compliance can be achieved without precertificates (i.e., SCTs can be sent via the signed_certificate_timestamp TLS extension or via OCSP stapling - see https://tools.ietf.org/html/rfc6962#section-3.3), so 44f22d2 will result in some certs (that don't have corresponding precerts) being missed.

robstradling · 2021-04-14T21:02:58Z

Perhaps you feel it's accurate enough?

vanbroup · 2021-04-15T04:40:52Z

I saw you are using a sub query for this but suppose this comes at a much higher load. This with the simplicity of this method was the main reason to use this method for now.

Eventually we should be scanning all active certificates, and probably use your sub query, currently we are still optimizing the data sets.

How would you estimate the impact of the sub query?

robstradling · 2021-04-15T09:48:47Z

I tried a few different approaches yesterday, but this is the best I could come up with:
robstradling@8866d3e

A quick comparison suggests that this is at least 5x slower than your current query (hence why that commit also reduces SELECT_LIMIT).

vanbroup · 2021-04-15T10:05:10Z

Assuming we want to cover all certs, it would make sense to adopt that query, but if you suggest that it would only process about 100 certificates per minute (as the query timeout is 60 seconds) this is never going to finish.

I was successfully running in batches of 20.000 with fifty workers (without filtering for pre-certificates) and could complete a 110.000.000 certificate scan in a few days. But if you don't mind the load on crt.sh I'm happy to try this new query as it's better for the results.

robstradling · 2021-04-15T10:12:25Z

I don't think there's any problem with the load on crt.sh. It's just the increase in total runtime that's undesirable.

Here's a partial implementation of an alternative approach, which would avoid that ~5x performance penalty for the SELECT query but which would require the result-set to be post-processed(*):
robstradling@fd1211f

(*) You would need to also feed the SHA-256(TBSCertificate with CT extensions removed) field into the CSV output file, then post-process the CSV data so that rows with the same SHA-256(TBSCertificate with CT extensions removed) are deduplicated.

robstradling · 2021-04-15T10:31:11Z

Ooh, another, much simpler alternative would be to stick with the original query (85593b7), but change the first column of the CSV output to be https://crt.sh/?serial=<serialnumber> instead of https://crt.sh/?sha256=<SHA-256(Certificate)>.
Then simply pipe the CSV output through sort and uniq to deduplicate.

This is a less strict approach to deduplicating, but it should produce the same result (except for any rare cases where an Issuing CA incorrectly reuses a serial number).

vanbroup · 2021-04-15T11:00:40Z

Post processing is probably a viable alternative, but it would initially process about 30 million certificates more than needed, in the worst-case scenario we would process 50% more than needed.

Would it be possible to query all certificates that have no pre-certificate logged?

robstradling · 2021-04-15T11:42:45Z

in the worst-case scenario we would process 50% more than needed.

But you'd still be processing all of that at least 5x more quickly (not including the sort | uniq step) than the alternative (robstradling@8866d3e), which I think is a net win.

Would it be possible to query all certificates that have no pre-certificate logged?

There's no index to help with that, so it would be even less efficient than each of the other options we've discussed so far.

vanbroup closed this as completed in 44f22d2 Apr 14, 2021

vanbroup changed the title ~~Don't include final certificates~~ Only process unique certificates Apr 15, 2021

vanbroup reopened this Apr 15, 2021

vanbroup added the enhancement New feature or request label Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only process unique certificates #2

Only process unique certificates #2

vanbroup commented Apr 13, 2021

robstradling commented Apr 14, 2021

robstradling commented Apr 14, 2021

vanbroup commented Apr 15, 2021 •

edited

Loading

robstradling commented Apr 15, 2021

vanbroup commented Apr 15, 2021

robstradling commented Apr 15, 2021

robstradling commented Apr 15, 2021

vanbroup commented Apr 15, 2021

robstradling commented Apr 15, 2021

Only process unique certificates #2

Only process unique certificates #2

Comments

vanbroup commented Apr 13, 2021

robstradling commented Apr 14, 2021

robstradling commented Apr 14, 2021

vanbroup commented Apr 15, 2021 • edited Loading

robstradling commented Apr 15, 2021

vanbroup commented Apr 15, 2021

robstradling commented Apr 15, 2021

robstradling commented Apr 15, 2021

vanbroup commented Apr 15, 2021

robstradling commented Apr 15, 2021

vanbroup commented Apr 15, 2021 •

edited

Loading