Add C3ID implementation #21

nguerrera · 2025-01-08T16:01:25Z

This implements C3ID computation that we will need soon.

The implementation differs from the one in https://github.com/microsoft/security-utilities and this seeks to become the new standard.

Changes:

There is no use of "Microsoft"
There is no byte array to hex conversion in the intermediate computation
Optimized to avoid allocations
Raw value is truncated to 12 bytes instead of 15
"C3ID" is prefixed to the canonical representation in base64

Tests include a naive, unoptimized implementation that is easier to understand, and we check that both implementations match. We also check that we produce the same deterministic value for a sampling of hard-coded test input.

This change also fixes edge cases with empty span input in encoding polyfills.

This ports C3ID computation (that we will need soon) from https://github.com/microsoft/security-utilities Changes: - Company name is a parameter and not hard-coded to Microsoft - Optimized like the rest of Cask to avoid allocations and write the result to a span Tests added here were also run against the implementation in security-utilities to check equivalence. Also: fix edge cases with empty span input in encoding polyfills.

nguerrera · 2025-01-08T16:41:12Z

#22 should fix the build issue. Let's get that reviewed and merged first.

src/Cask/CrossCompanyCorrelatingId.cs

michaelcfanning · 2025-01-09T17:03:54Z

src/Cask/CrossCompanyCorrelatingId.cs

+
+namespace CommonAnnotatedSecurityKeys;
+
+internal static class CrossCompanyCorrelatingId


CrossCompanyCorrelatingId

Just a thought, but I tend to think that a nice summary element on every class type is a nice baseline requirement, even if projects don't take on the costs of authoring nicely rendered per-method/property comments as you've done here.

This class declaration would be a nice place to describe a c3id in general, to link to a reference doc, etc.

No action for you, I can take an action to fill this out later.

I've put a summary, but it will need more work. I will work on a proper specification and update the doc comments. I'll do that separately so that I can consume this in another branch, get out that PR, then work on the spec and docs in parallel with the review for that larger PR.

src/Cask/CrossCompanyCorrelatingId.cs

michaelcfanning · 2025-01-09T17:22:43Z

src/Tests/Cask.Tests/CrossCompanyCorrelatingIdTests.cs

+    [InlineData("😁", "f/BTV0j6A8km4KDw7aJz")]
+    public void Test_Basic(string text, string expected)
+    {
+        string actual = ComputeC3IDBase64(company: "Microsoft", text);


Hey is the proper rendering C3ID or C3id?

Haha. It is a true fact that I treat this as a valid question despite the obvious triviality of the concern.

I think I prefer C3id but need I mention I have no desire to die on this hill? :)

Tehe. I think I prefer C3Id or C3ID over C3id. Id is a word so if the convention in C# is CreateUrl, C3Id would be the most consistent.

It does not appear in any identifier other than in tests. As such, my care factor is vanishingly small. I'm just as picky about these things in public API, so no judgment. 😁 I'll happily change it, but I still don't which to use...

src/Cask/CrossCompanyCorrelatingId.cs

src/Tests/Cask.Tests/CrossCompanyCorrelatingIdTests.cs

michaelcfanning · 2025-01-09T23:33:10Z

src/Tests/Cask.Tests/CrossCompanyCorrelatingIdTests.cs

+public class CrossCompanyCorrelatingIdTests
+{
+    [Theory]
+    [InlineData("", "C3IDKZnmJO6my5hknQ5W")]


C3IDKZnmJO6my5hknQ5W

this value could become very familiar to folks troubleshooting data issues. :) i think your idea about possibly putting some exceptions/data guards in place (for the text that we hash) could make. certainly failing a string.IsNullOrWhitespace check would qualify.

the problem is that it can be unsatisfying to add an incomplete set of guards in circumstances. but this one may be worth adding!

michaelcfanning · 2025-01-09T23:33:32Z

src/Tests/Cask.Tests/CrossCompanyCorrelatingIdTests.cs

+    }
+
+    [Fact]
+    public void C3ID_NullInput_Throws()


C3ID_NullInput_Throws

how about strengthening this a bit, null, empty or all whitespace throws?

michaelcfanning · 2025-01-09T23:34:02Z

src/Cask/CrossCompanyCorrelatingId.cs

+public static class CrossCompanyCorrelatingId
+{
+    /// <summary>
+    /// The size of a Cross-Company Correlating ID (aka C3ID) in raw bytes.


aka

nit: remove aka

michaelcfanning · 2025-01-09T23:34:51Z

src/Cask/CrossCompanyCorrelatingId.cs

+    public const int RawSizeInBytes = 12;
+
+    private static ReadOnlySpan<byte> Prefix => "C3ID"u8;
+    private static ReadOnlySpan<byte> PrefixBase64Decoded => [0x0B, 0x72, 0x03];


PrefixBase64Decoded

add comment communicating that this is the bytewise encoding of the base64 encoded literal c3id?

michaelcfanning · 2025-01-09T23:35:16Z

src/Cask/CrossCompanyCorrelatingId.cs

+    /// </summary>
+    public static string Compute(string text)
+    {
+        ThrowIfNull(text);


ThrowIfNull

ThrowIfNullEmptyOrWhitespace and make it so?

michaelcfanning · 2025-01-09T23:41:19Z

src/Cask/CrossCompanyCorrelatingId.cs

+    /// Computes the raw C3ID bytes for the given text and writes them to the
+    /// destination span.
+    /// </summary>
+    public static void Compute(string text, Span<byte> destination)


Compute

You know, I've wondered, why do some API go ahead and return a span in some API and in others we provide an argument.

And you've answered it! If you would otherwise have an overload that would differ only by return type, you are stuck!

honestly, this is an (unavoidable as far as I can see) idiosyncrasy that's a bit regrettable.

but what can you do? the fact is that string is now effectively a replaceable entity and the only way to provide functionality that diverges is by unique parameterization.

another bit of weirdness is that in places where you can control the return value and use Span the ReadOnly version might be appropriate (and easy to forget to use??)

I'm going on about this too much, forgive me, it's fun to talk to you about this stuff, but I'm wondering about alternate API shapes, e.g.:

void Compute(string text, out string destination);
void Compute(string text, Span destination);

also, potentially zapping exception raising by simply providing these:

bool TryCompute(string text, out string destination);
bool TryCompute(string text, Span destination);

michaelcfanning · 2025-01-09T23:48:45Z

src/Cask/CrossCompanyCorrelatingId.cs

+    /// </summary>
+    public static void Compute(ReadOnlySpan<char> text, Span<byte> destination)
+    {
+        ThrowIfDestinationTooSmall(destination, RawSizeInBytes);


ThrowIfDestinationTooSmall

Ah, if we accept my idea around preventing empty or whitespace we'd want to add it here. So that the upstream doesn't duplicate the work.

I'm on the fence. On the one hand, it's perfectly legal to generate a hash for one or more tabs/spaces, etc. on the other hand, this data will never comprise something we should generate a c3id for. and certainly hashes of the empty string are a notorious source of data noise.

Your thoughts?

michaelcfanning · 2025-01-09T23:52:17Z

src/Tests/Cask.Tests/CrossCompanyCorrelatingIdTests.cs

+            Span<byte> hash = SHA256.HashData(Encoding.UTF8.GetBytes(text));
+
+            // Prefix the result with "C3ID" UTF-8 bytes and hash again
+            hash = SHA256.HashData([.. "C3ID"u8, .. hash]);


HashData

Wait, aren't we supposed to be using the PrefixBase64Decoded const bytes here?

Ah, I see all tests are passing because your optimized version does the same thing. I thought we were going to use the decoded bytes directly not the UTF8 bytes of the encoded form.

nguerrera requested review from michaelcfanning, rwoll and suvamM as code owners January 8, 2025 16:01

nguerrera force-pushed the c3id branch from de47666 to a5c7249 Compare January 8, 2025 16:06

nguerrera closed this Jan 9, 2025

nguerrera reopened this Jan 9, 2025

michaelcfanning reviewed Jan 9, 2025

View reviewed changes

src/Cask/CrossCompanyCorrelatingId.cs Outdated Show resolved Hide resolved