refactor(clustering)!: use new DbscanCluster type instead of isize #233

sd2k · 2025-01-12T13:13:27Z

Add a new DbscanCluster type to represent the cluster labels, which
can be either a cluster ID or a special value indicating noise. The
Cluster variant is now a NonZeroU32 instead of an isize.

We could use a u32 but using a NonZeroU32 allows us to make use
of the niche optimizations that NonZeroU32 provides and halves the
size of the type. Questionable whether it's worth it though?

Closes #229.

Summary by CodeRabbit

New Features
- Introduced DbscanCluster enum to provide a more structured representation of clustering results.
- Added methods to manage cluster identification and noise detection.
Bug Fixes
- Improved cluster representation to ensure more robust and type-safe clustering output.
Refactor
- Updated clustering implementation to use new DbscanCluster type across multiple components.
- Modified return types and assertions to work with the new cluster representation.
- Enhanced test structure for clarity and maintainability.

coderabbitai · 2025-01-12T13:13:34Z

Warning

Rate limit exceeded

@sd2k has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 4 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 12d5d73 and 48fd9e4.

📒 Files selected for processing (9)

book/book.toml (1 hunks)
book/src/tutorials/clustering.md (2 hunks)
crates/augurs-clustering/README.md (2 hunks)
crates/augurs-clustering/src/lib.rs (7 hunks)
crates/augurs/tests/integration.rs (2 hunks)
crates/pyaugurs/src/clustering.rs (1 hunks)
examples/clustering/examples/dbscan_clustering.rs (2 hunks)
js/augurs-clustering-js/src/lib.rs (1 hunks)
js/testpkg/clustering.test.ts (3 hunks)

Walkthrough

The pull request introduces a new DbscanCluster enum to replace the previous isize-based cluster representation in the augurs-clustering library. This change enhances the type safety and clarity of cluster assignments by providing a more structured approach to representing clustering results. The modification spans multiple files, updating the implementation, tests, Python bindings, and example code to use the new DbscanCluster type, which can represent either a specific cluster or noise points.

Changes

File	Change Summary
`crates/augurs-clustering/README.md`	Updated to include `DbscanCluster` in usage example
`crates/augurs-clustering/src/lib.rs`	Added `DbscanCluster` enum with methods like `is_noise()`, `is_cluster()`, `as_i32()` and modified `fit()` method to return `Vec<DbscanCluster>`
`crates/augurs/tests/integration.rs`	Added `convert_clusters()` helper function to convert `DbscanCluster` to `i32` for testing
`crates/pyaugurs/src/clustering.rs`	Updated `fit()` method return type and implementation to use `DbscanCluster`
`examples/clustering/examples/dbscan_clustering.rs`	Modified cluster assertions and added output to use `DbscanCluster`
`js/augurs-clustering-js/src/lib.rs`	Updated `fit()` method return type and implementation to use `i32` instead of `isize`

Assessment against linked issues

Objective	Addressed	Explanation
Remove use of `isize` [#229]	✅

Possibly related PRs

feat: add augurs-clustering crate with DBSCAN algorithm #100: The changes in the main PR involve the addition of the DbscanCluster entity and modifications to the DBSCAN algorithm, which are directly related to the augurs-clustering crate introduced in this PR.
fix: use more accurate series count to avoid panic #101: This PR also relates to the DBSCAN algorithm, specifically enhancing the handling of series data within the DBSCANDetector, which is relevant to the clustering functionality in the main PR.
fix!: add serde derives for more types #112: The changes in this PR include updates to the OutlierOutput and OutlierIntervals structures, which may interact with clustering results, particularly in the context of outlier detection using clustering methods.

Suggested reviewers

edwardcqian
gitdoluquita

Poem

🐰 Clustering magic, a new design so bright,
No more isize, our types now take flight!
DbscanCluster emerges, with structure so clean,
From noise to clusters, a rabbit's dream scene!
Code transforms with elegance and might! 🚀

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

crates/augurs-clustering/src/lib.rs (3)
42-47: Improve panic message in increment method.

The panic message "cluster ID overflow" could be more descriptive to help with debugging.
-            Self::Cluster(id) => *id = id.checked_add(1).expect("cluster ID overflow"),
+            Self::Cluster(id) => *id = id.checked_add(1).expect("maximum cluster ID (4,294,967,295) reached"),
50-60: Document test-only PartialEq implementation.

Add documentation to explain why this implementation exists and its test-only purpose.
 // Simplify tests by allowing comparisons with i32.
+// This implementation allows direct comparison with legacy test vectors that use i32,
+// where -1 represents noise and positive numbers represent cluster IDs.
 #[cfg(test)]
 impl PartialEq<i32> for DbscanCluster {
209-238: Consider using a more maintainable format for large test vectors.

The large test vector could be more maintainable if stored in a separate test data file or formatted using a more readable pattern.

Consider one of these approaches:

Move the vector to a separate test data file (preferred)

Use a more compact format with line breaks at logical boundaries

Generate the vector programmatically if it follows a pattern

Example of approach 2:
let expected = vec![
    // Cluster 1 (main cluster)
    &[1; 34][..],
    // Cluster 2
    &[2; 3][..],
    // Back to cluster 1
    &[1; 456][..],
    // Mixed section
    3, -1, 3, -1, 3,
    // Remaining cluster 1
    &[1; 283][..],
].concat();
examples/clustering/examples/dbscan_clustering.rs (1)
57-66: Enhance example documentation for the new cluster type.

While the example correctly demonstrates the usage, it would be more helpful to add comments explaining the new cluster type representation.
     println!("Clusters: {:?}", clusters);
     assert_eq!(
         clusters,
         vec![
+            // Cluster ID 1 (using NonZeroU32 internally)
             DbscanCluster::Cluster(1.try_into().unwrap()),
             DbscanCluster::Cluster(1.try_into().unwrap()),
+            // Cluster ID 2
             DbscanCluster::Cluster(2.try_into().unwrap()),
             DbscanCluster::Cluster(2.try_into().unwrap()),
+            // Points not belonging to any cluster
             DbscanCluster::Noise,
         ]
     );
crates/augurs-clustering/README.md (1)
35-44: Enhance example readability and coverage.

While the example correctly demonstrates the new DbscanCluster type, consider these improvements:

Add an example of noise points to demonstrate both variants of DbscanCluster

Consider extracting the try_into().unwrap() into a helper function or const to reduce repetition

Add a comment explaining why NonZeroU32 is used for cluster IDs

Example improvement:
// Cluster IDs use NonZeroU32 for memory optimization
let cluster1: NonZeroU32 = 1.try_into().unwrap();
let cluster2: NonZeroU32 = 2.try_into().unwrap();

assert_eq!(
    clusters,
    vec![
        DbscanCluster::Cluster(cluster1),
        DbscanCluster::Cluster(cluster1),
        DbscanCluster::Cluster(cluster1),
        DbscanCluster::Cluster(cluster2),
        DbscanCluster::Cluster(cluster2),
    ],
);

// Example with noise points
let clusters_with_noise = DbscanClusterer::new(0.1, 3).fit(&distance_matrix);
assert!(clusters_with_noise.iter().any(|c| matches!(c, DbscanCluster::Noise)));

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e098590 and 5ee35f3.

📒 Files selected for processing (5)

crates/augurs-clustering/README.md (2 hunks)
crates/augurs-clustering/src/lib.rs (7 hunks)
crates/augurs/tests/integration.rs (2 hunks)
crates/pyaugurs/src/clustering.rs (1 hunks)
examples/clustering/examples/dbscan_clustering.rs (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (20)

GitHub Check: windows (windows-latest, x86)
GitHub Check: linux (ubuntu-22.04, ppc64le)
GitHub Check: windows (windows-latest, x64)
GitHub Check: musllinux (ubuntu-22.04, armv7)
GitHub Check: linux (ubuntu-22.04, s390x)
GitHub Check: musllinux (ubuntu-22.04, aarch64)
GitHub Check: linux (ubuntu-22.04, armv7)
GitHub Check: musllinux (ubuntu-22.04, x86)
GitHub Check: linux (ubuntu-22.04, aarch64)
GitHub Check: macos (macos-14, aarch64)
GitHub Check: musllinux (ubuntu-22.04, x86_64)
GitHub Check: Tests
GitHub Check: linux (ubuntu-22.04, x86)
GitHub Check: Test Book
GitHub Check: macos (macos-13, x86_64)
GitHub Check: linux (ubuntu-22.04, x86_64)
GitHub Check: Clippy
GitHub Check: Prophet WASMStan component
GitHub Check: Python tests
GitHub Check: Check

🔇 Additional comments (4)

crates/pyaugurs/src/clustering.rs (1)

85-94: LGTM! Clean Python binding implementation.

The conversion to i32 is appropriate for Python/numpy compatibility, and the implementation correctly maps the new DbscanCluster type.

crates/augurs/tests/integration.rs (1)

21-23: LGTM! Well-structured test adaptation.

The convert_clusters helper function provides a clean way to adapt the tests to the new type system while maintaining readability.

crates/augurs-clustering/README.md (2)

11-11: LGTM! Import statement updated to reflect new type.

The addition of DbscanCluster to the imports clearly shows users the available types and aligns with the new type system.

35-44: Verify example correctness with the actual implementation.

Let's ensure the example matches the actual implementation and test cases.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

crates/augurs-clustering/src/lib.rs (1)
209-238: Consider moving the large test vector to a separate file.

The large test vector makes the test file harder to maintain. Consider moving it to a separate test data file, similar to how dist.csv is handled.
-        let expected = vec![
-            1, 1, 1, 1, /* ... many lines ... */ 1, 1, 1, 1,
-        ];
+        let expected: Vec<i32> = include_str!("../data/expected_clusters.txt")
+            .split(',')
+            .map(|s| s.trim().parse().unwrap())
+            .collect();

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ee35f3 and 12d5d73.

📒 Files selected for processing (6)

crates/augurs-clustering/README.md (2 hunks)
crates/augurs-clustering/src/lib.rs (7 hunks)
crates/augurs/tests/integration.rs (2 hunks)
crates/pyaugurs/src/clustering.rs (1 hunks)
examples/clustering/examples/dbscan_clustering.rs (2 hunks)
js/augurs-clustering-js/src/lib.rs (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

crates/augurs/tests/integration.rs
crates/pyaugurs/src/clustering.rs

⏰ Context from checks skipped due to timeout of 90000ms (20)

GitHub Check: linux (ubuntu-22.04, ppc64le)
GitHub Check: windows (windows-latest, x86)
GitHub Check: linux (ubuntu-22.04, s390x)
GitHub Check: musllinux (ubuntu-22.04, armv7)
GitHub Check: windows (windows-latest, x64)
GitHub Check: musllinux (ubuntu-22.04, aarch64)
GitHub Check: linux (ubuntu-22.04, armv7)
GitHub Check: musllinux (ubuntu-22.04, x86)
GitHub Check: linux (ubuntu-22.04, aarch64)
GitHub Check: musllinux (ubuntu-22.04, x86_64)
GitHub Check: Test Book
GitHub Check: linux (ubuntu-22.04, x86)
GitHub Check: Tests
GitHub Check: macos (macos-14, aarch64)
GitHub Check: linux (ubuntu-22.04, x86_64)
GitHub Check: Clippy
GitHub Check: JS tests
GitHub Check: Check
GitHub Check: Prophet WASMStan component
GitHub Check: macos (macos-13, x86_64)

🔇 Additional comments (6)

crates/augurs-clustering/src/lib.rs (3)

7-48: Well-designed type implementation!

The DbscanCluster enum with NonZeroU32 leverages niche optimization, reducing the enum size. The implementation includes comprehensive helper methods and proper documentation.

50-60: Good test helper implementation!

The PartialEq<i32> implementation is appropriately scoped to tests and handles all cases correctly.

Line range hint 98-146: Verify all callers handle the new return type.

The implementation correctly uses the new DbscanCluster type. Let's verify that all callers are updated to handle the new return type.

✅ Verification successful

All callers properly handle the new DbscanCluster type

The change is safe as backward compatibility is maintained through:

PartialEq implementation allowing direct comparisons with integers

Proper conversion methods between old and new representations

Tests and examples showing correct usage of both styles
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Find all direct callers of the fit method to verify they handle the new return type

# Search for direct calls to fit
rg -p "\.fit$&.*$" --type rust

# Search for any remaining isize type annotations that might need updating
rg -p ": Vec<isize>" --type rust
Length of output: 2819

Script:
#!/bin/bash
# Search for usage patterns of the fit return value
rg -p "let .* = .*\.fit$&.*$" -A 3 --type rust

# Search for any pattern matching or type annotations related to clusters
ast-grep --pattern 'match $_ {
  DbscanCluster::$$ => $$$
}'
Length of output: 4510

Script:
#!/bin/bash
# Find DbscanCluster type definition
ast-grep --pattern 'enum DbscanCluster {
  $$$
}'

# Find convert_clusters implementation
ast-grep --pattern 'fn convert_clusters($_) -> $$ {
  $$$
}'

# Find any direct comparisons with Vec<isize>
rg "assert_eq.*vec!\[-" --type rust
Length of output: 651

Script:
#!/bin/bash
# Find the file containing DbscanCluster definition and its usage
rg -p "DbscanCluster" --type rust -A 5

# Look for any conversion functions between DbscanCluster and isize
rg -p "impl.*DbscanCluster" --type rust -A 10
Length of output: 8138
crates/augurs-clustering/README.md (1)

11-11: Documentation accurately reflects the API changes!

The README example clearly demonstrates the usage of the new DbscanCluster type.

Also applies to: 35-44

examples/clustering/examples/dbscan_clustering.rs (1)

55-66: Example correctly demonstrates the new API!

The example has been properly updated to use DbscanCluster and includes helpful debug output.

js/augurs-clustering-js/src/lib.rs (1)

47-53: Clean JS binding implementation!

The binding properly converts DbscanCluster to i32 for JS compatibility while maintaining the same semantics.

Add a new `DbscanCluster` type to represent the cluster labels, which can be either a cluster ID or a special value indicating noise. The `Cluster` variant is now a `NonZeroU32` instead of an `isize`. We could use a `u32` but using a `NonZeroU32` allows us to make use of the niche optimizations that `NonZeroU32` provides and halves the size of the type. Questionable whether it's worth it though? Closes #229.

sd2k · 2025-01-12T13:53:57Z

@coderabbitai I'd be interested to hear your thoughts on whether this more explicitly typed approach is a worthwhile tradeoff. IMO it's slightly more typing for users, but more self-explanatory - there's no implicit assumption that -1 is a noise cluster.

coderabbitai · 2025-01-12T13:54:19Z

Tip

For best results, initiate chat on the files or code changes.

@sd2k I agree that this is a worthwhile tradeoff. The explicit typing with DbscanCluster brings several benefits:

Type safety: It prevents accidental use of invalid cluster IDs and makes the API harder to misuse
Self-documenting: The distinction between noise and clusters is clear from the type system itself
Memory optimization: Using NonZeroU32 for cluster IDs is more efficient

While it requires slightly more typing from users, the improved clarity and removal of implicit assumptions (like -1 being noise) make the API more maintainable and less error-prone in the long run.

✏️ Learnings added

Learnt from: sd2k
URL: https://github.com/grafana/augurs/pull/233

Timestamp: 2025-01-12T13:54:17.843Z
Learning: In the augurs clustering implementation, the `DbscanCluster` enum with `Cluster(NonZeroU32)` and `Noise` variants is preferred over using `isize` with `-1` for noise points, as it provides better type safety and self-documentation despite being more verbose.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai bot reviewed Jan 12, 2025

View reviewed changes

sd2k force-pushed the custom-cluster-type branch from 5ee35f3 to 12d5d73 Compare January 12, 2025 13:27

coderabbitai bot reviewed Jan 12, 2025

View reviewed changes

sd2k force-pushed the custom-cluster-type branch from 12d5d73 to 618abe5 Compare January 12, 2025 13:31

sd2k force-pushed the custom-cluster-type branch from 618abe5 to 48fd9e4 Compare January 12, 2025 13:39

sd2k merged commit 867f261 into main Jan 12, 2025
24 checks passed

sd2k deleted the custom-cluster-type branch January 12, 2025 13:55

sd2k mentioned this pull request Jan 11, 2025

chore: release v0.9.0 #232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(clustering)!: use new DbscanCluster type instead of isize #233

refactor(clustering)!: use new DbscanCluster type instead of isize #233

sd2k commented Jan 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 12, 2025 •

edited

Loading

Rate limit exceeded

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

sd2k commented Jan 12, 2025

coderabbitai bot commented Jan 12, 2025

refactor(clustering)!: use new DbscanCluster type instead of isize #233

refactor(clustering)!: use new DbscanCluster type instead of isize #233

Conversation

sd2k commented Jan 12, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Jan 12, 2025 • edited Loading

Rate limit exceeded

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

sd2k commented Jan 12, 2025

coderabbitai bot commented Jan 12, 2025

sd2k commented Jan 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 12, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)