Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add C3ID implementation #21

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Add C3ID implementation #21

wants to merge 3 commits into from

Conversation

nguerrera
Copy link
Contributor

@nguerrera nguerrera commented Jan 8, 2025

This implements C3ID computation that we will need soon.

The implementation differs from the one in https://github.com/microsoft/security-utilities and this seeks to become the new standard.

Changes:

  • There is no use of "Microsoft"
  • There is no byte array to hex conversion in the intermediate computation
  • Optimized to avoid allocations
  • Raw value is truncated to 12 bytes instead of 15
  • "C3ID" is prefixed to the canonical representation in base64

Tests include a naive, unoptimized implementation that is easier to understand, and we check that both implementations match. We also check that we produce the same deterministic value for a sampling of hard-coded test input.

This change also fixes edge cases with empty span input in encoding polyfills.

This ports C3ID computation (that we will need soon) from https://github.com/microsoft/security-utilities

Changes:
- Company name is a parameter and not hard-coded to Microsoft
- Optimized like the rest of Cask to avoid allocations and write the result to a span

Tests added here were also run against the implementation in security-utilities to check equivalence.

Also: fix edge cases with empty span input in encoding polyfills.
@nguerrera
Copy link
Contributor Author

#22 should fix the build issue. Let's get that reviewed and merged first.

@nguerrera nguerrera closed this Jan 9, 2025
@nguerrera nguerrera reopened this Jan 9, 2025

namespace CommonAnnotatedSecurityKeys;

internal static class CrossCompanyCorrelatingId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CrossCompanyCorrelatingId

Just a thought, but I tend to think that a nice summary element on every class type is a nice baseline requirement, even if projects don't take on the costs of authoring nicely rendered per-method/property comments as you've done here.

This class declaration would be a nice place to describe a c3id in general, to link to a reference doc, etc.

No action for you, I can take an action to fill this out later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put a summary, but it will need more work. I will work on a proper specification and update the doc comments. I'll do that separately so that I can consume this in another branch, get out that PR, then work on the spec and docs in parallel with the review for that larger PR.

[InlineData("😁", "f/BTV0j6A8km4KDw7aJz")]
public void Test_Basic(string text, string expected)
{
string actual = ComputeC3IDBase64(company: "Microsoft", text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey is the proper rendering C3ID or C3id?

Haha. It is a true fact that I treat this as a valid question despite the obvious triviality of the concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer C3id but need I mention I have no desire to die on this hill? :)

Copy link
Member

@rwoll rwoll Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tehe. I think I prefer C3Id or C3ID over C3id. Id is a word so if the convention in C# is CreateUrl, C3Id would be the most consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not appear in any identifier other than in tests. As such, my care factor is vanishingly small. I'm just as picky about these things in public API, so no judgment. 😁 I'll happily change it, but I still don't which to use...

public class CrossCompanyCorrelatingIdTests
{
[Theory]
[InlineData("", "C3IDKZnmJO6my5hknQ5W")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C3IDKZnmJO6my5hknQ5W

this value could become very familiar to folks troubleshooting data issues. :) i think your idea about possibly putting some exceptions/data guards in place (for the text that we hash) could make. certainly failing a string.IsNullOrWhitespace check would qualify.

the problem is that it can be unsatisfying to add an incomplete set of guards in circumstances. but this one may be worth adding!

}

[Fact]
public void C3ID_NullInput_Throws()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C3ID_NullInput_Throws

how about strengthening this a bit, null, empty or all whitespace throws?

public static class CrossCompanyCorrelatingId
{
/// <summary>
/// The size of a Cross-Company Correlating ID (aka C3ID) in raw bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aka

nit: remove aka

public const int RawSizeInBytes = 12;

private static ReadOnlySpan<byte> Prefix => "C3ID"u8;
private static ReadOnlySpan<byte> PrefixBase64Decoded => [0x0B, 0x72, 0x03];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrefixBase64Decoded

add comment communicating that this is the bytewise encoding of the base64 encoded literal c3id?

/// </summary>
public static string Compute(string text)
{
ThrowIfNull(text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThrowIfNull

ThrowIfNullEmptyOrWhitespace and make it so?

/// Computes the raw C3ID bytes for the given text and writes them to the
/// destination span.
/// </summary>
public static void Compute(string text, Span<byte> destination)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute

You know, I've wondered, why do some API go ahead and return a span in some API and in others we provide an argument.

And you've answered it! If you would otherwise have an overload that would differ only by return type, you are stuck!

honestly, this is an (unavoidable as far as I can see) idiosyncrasy that's a bit regrettable.

but what can you do? the fact is that string is now effectively a replaceable entity and the only way to provide functionality that diverges is by unique parameterization.

another bit of weirdness is that in places where you can control the return value and use Span the ReadOnly version might be appropriate (and easy to forget to use??)

I'm going on about this too much, forgive me, it's fun to talk to you about this stuff, but I'm wondering about alternate API shapes, e.g.:

void Compute(string text, out string destination);
void Compute(string text, Span destination);

also, potentially zapping exception raising by simply providing these:

bool TryCompute(string text, out string destination);
bool TryCompute(string text, Span destination);

/// </summary>
public static void Compute(ReadOnlySpan<char> text, Span<byte> destination)
{
ThrowIfDestinationTooSmall(destination, RawSizeInBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThrowIfDestinationTooSmall

Ah, if we accept my idea around preventing empty or whitespace we'd want to add it here. So that the upstream doesn't duplicate the work.

I'm on the fence. On the one hand, it's perfectly legal to generate a hash for one or more tabs/spaces, etc. on the other hand, this data will never comprise something we should generate a c3id for. and certainly hashes of the empty string are a notorious source of data noise.

Your thoughts?

Span<byte> hash = SHA256.HashData(Encoding.UTF8.GetBytes(text));

// Prefix the result with "C3ID" UTF-8 bytes and hash again
hash = SHA256.HashData([.. "C3ID"u8, .. hash]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashData

Wait, aren't we supposed to be using the PrefixBase64Decoded const bytes here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see all tests are passing because your optimized version does the same thing. I thought we were going to use the decoded bytes directly not the UTF8 bytes of the encoded form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants