Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(clustering)!: use new DbscanCluster type instead of isize #233

Merged
merged 1 commit into from
Jan 12, 2025

Conversation

sd2k
Copy link
Collaborator

@sd2k sd2k commented Jan 12, 2025

Add a new DbscanCluster type to represent the cluster labels, which
can be either a cluster ID or a special value indicating noise. The
Cluster variant is now a NonZeroU32 instead of an isize.

We could use a u32 but using a NonZeroU32 allows us to make use
of the niche optimizations that NonZeroU32 provides and halves the
size of the type. Questionable whether it's worth it though?

Closes #229.

Summary by CodeRabbit

  • New Features

    • Introduced DbscanCluster enum to provide a more structured representation of clustering results.
    • Added methods to manage cluster identification and noise detection.
  • Bug Fixes

    • Improved cluster representation to ensure more robust and type-safe clustering output.
  • Refactor

    • Updated clustering implementation to use new DbscanCluster type across multiple components.
    • Modified return types and assertions to work with the new cluster representation.
    • Enhanced test structure for clarity and maintainability.

Copy link
Contributor

coderabbitai bot commented Jan 12, 2025

Warning

Rate limit exceeded

@sd2k has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 4 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 12d5d73 and 48fd9e4.

📒 Files selected for processing (9)
  • book/book.toml (1 hunks)
  • book/src/tutorials/clustering.md (2 hunks)
  • crates/augurs-clustering/README.md (2 hunks)
  • crates/augurs-clustering/src/lib.rs (7 hunks)
  • crates/augurs/tests/integration.rs (2 hunks)
  • crates/pyaugurs/src/clustering.rs (1 hunks)
  • examples/clustering/examples/dbscan_clustering.rs (2 hunks)
  • js/augurs-clustering-js/src/lib.rs (1 hunks)
  • js/testpkg/clustering.test.ts (3 hunks)

Walkthrough

The pull request introduces a new DbscanCluster enum to replace the previous isize-based cluster representation in the augurs-clustering library. This change enhances the type safety and clarity of cluster assignments by providing a more structured approach to representing clustering results. The modification spans multiple files, updating the implementation, tests, Python bindings, and example code to use the new DbscanCluster type, which can represent either a specific cluster or noise points.

Changes

File Change Summary
crates/augurs-clustering/README.md Updated to include DbscanCluster in usage example
crates/augurs-clustering/src/lib.rs Added DbscanCluster enum with methods like is_noise(), is_cluster(), as_i32() and modified fit() method to return Vec<DbscanCluster>
crates/augurs/tests/integration.rs Added convert_clusters() helper function to convert DbscanCluster to i32 for testing
crates/pyaugurs/src/clustering.rs Updated fit() method return type and implementation to use DbscanCluster
examples/clustering/examples/dbscan_clustering.rs Modified cluster assertions and added output to use DbscanCluster
js/augurs-clustering-js/src/lib.rs Updated fit() method return type and implementation to use i32 instead of isize

Assessment against linked issues

Objective Addressed Explanation
Remove use of isize [#229]

Possibly related PRs

Suggested reviewers

  • edwardcqian
  • gitdoluquita

Poem

🐰 Clustering magic, a new design so bright,
No more isize, our types now take flight!
DbscanCluster emerges, with structure so clean,
From noise to clusters, a rabbit's dream scene!
Code transforms with elegance and might! 🚀


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
crates/augurs-clustering/src/lib.rs (3)

42-47: Improve panic message in increment method.

The panic message "cluster ID overflow" could be more descriptive to help with debugging.

-            Self::Cluster(id) => *id = id.checked_add(1).expect("cluster ID overflow"),
+            Self::Cluster(id) => *id = id.checked_add(1).expect("maximum cluster ID (4,294,967,295) reached"),

50-60: Document test-only PartialEq implementation.

Add documentation to explain why this implementation exists and its test-only purpose.

 // Simplify tests by allowing comparisons with i32.
+// This implementation allows direct comparison with legacy test vectors that use i32,
+// where -1 represents noise and positive numbers represent cluster IDs.
 #[cfg(test)]
 impl PartialEq<i32> for DbscanCluster {

209-238: Consider using a more maintainable format for large test vectors.

The large test vector could be more maintainable if stored in a separate test data file or formatted using a more readable pattern.

Consider one of these approaches:

  1. Move the vector to a separate test data file (preferred)
  2. Use a more compact format with line breaks at logical boundaries
  3. Generate the vector programmatically if it follows a pattern

Example of approach 2:

let expected = vec![
    // Cluster 1 (main cluster)
    &[1; 34][..],
    // Cluster 2
    &[2; 3][..],
    // Back to cluster 1
    &[1; 456][..],
    // Mixed section
    3, -1, 3, -1, 3,
    // Remaining cluster 1
    &[1; 283][..],
].concat();
examples/clustering/examples/dbscan_clustering.rs (1)

57-66: Enhance example documentation for the new cluster type.

While the example correctly demonstrates the usage, it would be more helpful to add comments explaining the new cluster type representation.

     println!("Clusters: {:?}", clusters);
     assert_eq!(
         clusters,
         vec![
+            // Cluster ID 1 (using NonZeroU32 internally)
             DbscanCluster::Cluster(1.try_into().unwrap()),
             DbscanCluster::Cluster(1.try_into().unwrap()),
+            // Cluster ID 2
             DbscanCluster::Cluster(2.try_into().unwrap()),
             DbscanCluster::Cluster(2.try_into().unwrap()),
+            // Points not belonging to any cluster
             DbscanCluster::Noise,
         ]
     );
crates/augurs-clustering/README.md (1)

35-44: Enhance example readability and coverage.

While the example correctly demonstrates the new DbscanCluster type, consider these improvements:

  1. Add an example of noise points to demonstrate both variants of DbscanCluster
  2. Consider extracting the try_into().unwrap() into a helper function or const to reduce repetition
  3. Add a comment explaining why NonZeroU32 is used for cluster IDs

Example improvement:

// Cluster IDs use NonZeroU32 for memory optimization
let cluster1: NonZeroU32 = 1.try_into().unwrap();
let cluster2: NonZeroU32 = 2.try_into().unwrap();

assert_eq!(
    clusters,
    vec![
        DbscanCluster::Cluster(cluster1),
        DbscanCluster::Cluster(cluster1),
        DbscanCluster::Cluster(cluster1),
        DbscanCluster::Cluster(cluster2),
        DbscanCluster::Cluster(cluster2),
    ],
);

// Example with noise points
let clusters_with_noise = DbscanClusterer::new(0.1, 3).fit(&distance_matrix);
assert!(clusters_with_noise.iter().any(|c| matches!(c, DbscanCluster::Noise)));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e098590 and 5ee35f3.

📒 Files selected for processing (5)
  • crates/augurs-clustering/README.md (2 hunks)
  • crates/augurs-clustering/src/lib.rs (7 hunks)
  • crates/augurs/tests/integration.rs (2 hunks)
  • crates/pyaugurs/src/clustering.rs (1 hunks)
  • examples/clustering/examples/dbscan_clustering.rs (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: windows (windows-latest, x86)
  • GitHub Check: linux (ubuntu-22.04, ppc64le)
  • GitHub Check: windows (windows-latest, x64)
  • GitHub Check: musllinux (ubuntu-22.04, armv7)
  • GitHub Check: linux (ubuntu-22.04, s390x)
  • GitHub Check: musllinux (ubuntu-22.04, aarch64)
  • GitHub Check: linux (ubuntu-22.04, armv7)
  • GitHub Check: musllinux (ubuntu-22.04, x86)
  • GitHub Check: linux (ubuntu-22.04, aarch64)
  • GitHub Check: macos (macos-14, aarch64)
  • GitHub Check: musllinux (ubuntu-22.04, x86_64)
  • GitHub Check: Tests
  • GitHub Check: linux (ubuntu-22.04, x86)
  • GitHub Check: Test Book
  • GitHub Check: macos (macos-13, x86_64)
  • GitHub Check: linux (ubuntu-22.04, x86_64)
  • GitHub Check: Clippy
  • GitHub Check: Prophet WASMStan component
  • GitHub Check: Python tests
  • GitHub Check: Check
🔇 Additional comments (4)
crates/pyaugurs/src/clustering.rs (1)

85-94: LGTM! Clean Python binding implementation.

The conversion to i32 is appropriate for Python/numpy compatibility, and the implementation correctly maps the new DbscanCluster type.

crates/augurs/tests/integration.rs (1)

21-23: LGTM! Well-structured test adaptation.

The convert_clusters helper function provides a clean way to adapt the tests to the new type system while maintaining readability.

crates/augurs-clustering/README.md (2)

11-11: LGTM! Import statement updated to reflect new type.

The addition of DbscanCluster to the imports clearly shows users the available types and aligns with the new type system.


35-44: Verify example correctness with the actual implementation.

Let's ensure the example matches the actual implementation and test cases.

@sd2k sd2k force-pushed the custom-cluster-type branch from 5ee35f3 to 12d5d73 Compare January 12, 2025 13:27
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
crates/augurs-clustering/src/lib.rs (1)

209-238: Consider moving the large test vector to a separate file.

The large test vector makes the test file harder to maintain. Consider moving it to a separate test data file, similar to how dist.csv is handled.

-        let expected = vec![
-            1, 1, 1, 1, /* ... many lines ... */ 1, 1, 1, 1,
-        ];
+        let expected: Vec<i32> = include_str!("../data/expected_clusters.txt")
+            .split(',')
+            .map(|s| s.trim().parse().unwrap())
+            .collect();
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ee35f3 and 12d5d73.

📒 Files selected for processing (6)
  • crates/augurs-clustering/README.md (2 hunks)
  • crates/augurs-clustering/src/lib.rs (7 hunks)
  • crates/augurs/tests/integration.rs (2 hunks)
  • crates/pyaugurs/src/clustering.rs (1 hunks)
  • examples/clustering/examples/dbscan_clustering.rs (2 hunks)
  • js/augurs-clustering-js/src/lib.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/augurs/tests/integration.rs
  • crates/pyaugurs/src/clustering.rs
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: linux (ubuntu-22.04, ppc64le)
  • GitHub Check: windows (windows-latest, x86)
  • GitHub Check: linux (ubuntu-22.04, s390x)
  • GitHub Check: musllinux (ubuntu-22.04, armv7)
  • GitHub Check: windows (windows-latest, x64)
  • GitHub Check: musllinux (ubuntu-22.04, aarch64)
  • GitHub Check: linux (ubuntu-22.04, armv7)
  • GitHub Check: musllinux (ubuntu-22.04, x86)
  • GitHub Check: linux (ubuntu-22.04, aarch64)
  • GitHub Check: musllinux (ubuntu-22.04, x86_64)
  • GitHub Check: Test Book
  • GitHub Check: linux (ubuntu-22.04, x86)
  • GitHub Check: Tests
  • GitHub Check: macos (macos-14, aarch64)
  • GitHub Check: linux (ubuntu-22.04, x86_64)
  • GitHub Check: Clippy
  • GitHub Check: JS tests
  • GitHub Check: Check
  • GitHub Check: Prophet WASMStan component
  • GitHub Check: macos (macos-13, x86_64)
🔇 Additional comments (6)
crates/augurs-clustering/src/lib.rs (3)

7-48: Well-designed type implementation!

The DbscanCluster enum with NonZeroU32 leverages niche optimization, reducing the enum size. The implementation includes comprehensive helper methods and proper documentation.


50-60: Good test helper implementation!

The PartialEq<i32> implementation is appropriately scoped to tests and handles all cases correctly.


Line range hint 98-146: Verify all callers handle the new return type.

The implementation correctly uses the new DbscanCluster type. Let's verify that all callers are updated to handle the new return type.

✅ Verification successful

All callers properly handle the new DbscanCluster type

The change is safe as backward compatibility is maintained through:

  • PartialEq implementation allowing direct comparisons with integers
  • Proper conversion methods between old and new representations
  • Tests and examples showing correct usage of both styles
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all direct callers of the fit method to verify they handle the new return type

# Search for direct calls to fit
rg -p "\.fit\(&.*\)" --type rust

# Search for any remaining isize type annotations that might need updating
rg -p ": Vec<isize>" --type rust

Length of output: 2819


Script:

#!/bin/bash
# Search for usage patterns of the fit return value
rg -p "let .* = .*\.fit\(&.*\)" -A 3 --type rust

# Search for any pattern matching or type annotations related to clusters
ast-grep --pattern 'match $_ {
  DbscanCluster::$$ => $$$
}'

Length of output: 4510


Script:

#!/bin/bash
# Find DbscanCluster type definition
ast-grep --pattern 'enum DbscanCluster {
  $$$
}'

# Find convert_clusters implementation
ast-grep --pattern 'fn convert_clusters($_) -> $$ {
  $$$
}'

# Find any direct comparisons with Vec<isize>
rg "assert_eq.*vec!\[-" --type rust

Length of output: 651


Script:

#!/bin/bash
# Find the file containing DbscanCluster definition and its usage
rg -p "DbscanCluster" --type rust -A 5

# Look for any conversion functions between DbscanCluster and isize
rg -p "impl.*DbscanCluster" --type rust -A 10

Length of output: 8138

crates/augurs-clustering/README.md (1)

11-11: Documentation accurately reflects the API changes!

The README example clearly demonstrates the usage of the new DbscanCluster type.

Also applies to: 35-44

examples/clustering/examples/dbscan_clustering.rs (1)

55-66: Example correctly demonstrates the new API!

The example has been properly updated to use DbscanCluster and includes helpful debug output.

js/augurs-clustering-js/src/lib.rs (1)

47-53: Clean JS binding implementation!

The binding properly converts DbscanCluster to i32 for JS compatibility while maintaining the same semantics.

@sd2k sd2k force-pushed the custom-cluster-type branch from 12d5d73 to 618abe5 Compare January 12, 2025 13:31
Add a new `DbscanCluster` type to represent the cluster labels, which
can be either a cluster ID or a special value indicating noise. The
`Cluster` variant is now a `NonZeroU32` instead of an `isize`.

We could use a `u32` but using a `NonZeroU32` allows us to make use
of the niche optimizations that `NonZeroU32` provides and halves the
size of the type. Questionable whether it's worth it though?

Closes #229.
@sd2k sd2k force-pushed the custom-cluster-type branch from 618abe5 to 48fd9e4 Compare January 12, 2025 13:39
@sd2k
Copy link
Collaborator Author

sd2k commented Jan 12, 2025

@coderabbitai I'd be interested to hear your thoughts on whether this more explicitly typed approach is a worthwhile tradeoff. IMO it's slightly more typing for users, but more self-explanatory - there's no implicit assumption that -1 is a noise cluster.

Copy link
Contributor

coderabbitai bot commented Jan 12, 2025

Tip

For best results, initiate chat on the files or code changes.

@sd2k I agree that this is a worthwhile tradeoff. The explicit typing with DbscanCluster brings several benefits:

  1. Type safety: It prevents accidental use of invalid cluster IDs and makes the API harder to misuse
  2. Self-documenting: The distinction between noise and clusters is clear from the type system itself
  3. Memory optimization: Using NonZeroU32 for cluster IDs is more efficient

While it requires slightly more typing from users, the improved clarity and removal of implicit assumptions (like -1 being noise) make the API more maintainable and less error-prone in the long run.


✏️ Learnings added
Learnt from: sd2k
URL: https://github.com/grafana/augurs/pull/233

Timestamp: 2025-01-12T13:54:17.843Z
Learning: In the augurs clustering implementation, the `DbscanCluster` enum with `Cluster(NonZeroU32)` and `Noise` variants is preferred over using `isize` with `-1` for noise points, as it provides better type safety and self-documentation despite being more verbose.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@sd2k sd2k merged commit 867f261 into main Jan 12, 2025
24 checks passed
@sd2k sd2k deleted the custom-cluster-type branch January 12, 2025 13:55
@sd2k sd2k mentioned this pull request Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove use of isize in augurs-clustering
1 participant