-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(clustering)!: use new DbscanCluster type instead of isize #233
Conversation
Warning Rate limit exceeded@sd2k has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 4 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (9)
WalkthroughThe pull request introduces a new Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
crates/augurs-clustering/src/lib.rs (3)
42-47
: Improve panic message in increment method.The panic message "cluster ID overflow" could be more descriptive to help with debugging.
- Self::Cluster(id) => *id = id.checked_add(1).expect("cluster ID overflow"), + Self::Cluster(id) => *id = id.checked_add(1).expect("maximum cluster ID (4,294,967,295) reached"),
50-60
: Document test-only PartialEq implementation.Add documentation to explain why this implementation exists and its test-only purpose.
// Simplify tests by allowing comparisons with i32. +// This implementation allows direct comparison with legacy test vectors that use i32, +// where -1 represents noise and positive numbers represent cluster IDs. #[cfg(test)] impl PartialEq<i32> for DbscanCluster {
209-238
: Consider using a more maintainable format for large test vectors.The large test vector could be more maintainable if stored in a separate test data file or formatted using a more readable pattern.
Consider one of these approaches:
- Move the vector to a separate test data file (preferred)
- Use a more compact format with line breaks at logical boundaries
- Generate the vector programmatically if it follows a pattern
Example of approach 2:
let expected = vec![ // Cluster 1 (main cluster) &[1; 34][..], // Cluster 2 &[2; 3][..], // Back to cluster 1 &[1; 456][..], // Mixed section 3, -1, 3, -1, 3, // Remaining cluster 1 &[1; 283][..], ].concat();examples/clustering/examples/dbscan_clustering.rs (1)
57-66
: Enhance example documentation for the new cluster type.While the example correctly demonstrates the usage, it would be more helpful to add comments explaining the new cluster type representation.
println!("Clusters: {:?}", clusters); assert_eq!( clusters, vec![ + // Cluster ID 1 (using NonZeroU32 internally) DbscanCluster::Cluster(1.try_into().unwrap()), DbscanCluster::Cluster(1.try_into().unwrap()), + // Cluster ID 2 DbscanCluster::Cluster(2.try_into().unwrap()), DbscanCluster::Cluster(2.try_into().unwrap()), + // Points not belonging to any cluster DbscanCluster::Noise, ] );crates/augurs-clustering/README.md (1)
35-44
: Enhance example readability and coverage.While the example correctly demonstrates the new
DbscanCluster
type, consider these improvements:
- Add an example of noise points to demonstrate both variants of
DbscanCluster
- Consider extracting the
try_into().unwrap()
into a helper function or const to reduce repetition- Add a comment explaining why
NonZeroU32
is used for cluster IDsExample improvement:
// Cluster IDs use NonZeroU32 for memory optimization let cluster1: NonZeroU32 = 1.try_into().unwrap(); let cluster2: NonZeroU32 = 2.try_into().unwrap(); assert_eq!( clusters, vec![ DbscanCluster::Cluster(cluster1), DbscanCluster::Cluster(cluster1), DbscanCluster::Cluster(cluster1), DbscanCluster::Cluster(cluster2), DbscanCluster::Cluster(cluster2), ], ); // Example with noise points let clusters_with_noise = DbscanClusterer::new(0.1, 3).fit(&distance_matrix); assert!(clusters_with_noise.iter().any(|c| matches!(c, DbscanCluster::Noise)));
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
crates/augurs-clustering/README.md
(2 hunks)crates/augurs-clustering/src/lib.rs
(7 hunks)crates/augurs/tests/integration.rs
(2 hunks)crates/pyaugurs/src/clustering.rs
(1 hunks)examples/clustering/examples/dbscan_clustering.rs
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: windows (windows-latest, x86)
- GitHub Check: linux (ubuntu-22.04, ppc64le)
- GitHub Check: windows (windows-latest, x64)
- GitHub Check: musllinux (ubuntu-22.04, armv7)
- GitHub Check: linux (ubuntu-22.04, s390x)
- GitHub Check: musllinux (ubuntu-22.04, aarch64)
- GitHub Check: linux (ubuntu-22.04, armv7)
- GitHub Check: musllinux (ubuntu-22.04, x86)
- GitHub Check: linux (ubuntu-22.04, aarch64)
- GitHub Check: macos (macos-14, aarch64)
- GitHub Check: musllinux (ubuntu-22.04, x86_64)
- GitHub Check: Tests
- GitHub Check: linux (ubuntu-22.04, x86)
- GitHub Check: Test Book
- GitHub Check: macos (macos-13, x86_64)
- GitHub Check: linux (ubuntu-22.04, x86_64)
- GitHub Check: Clippy
- GitHub Check: Prophet WASMStan component
- GitHub Check: Python tests
- GitHub Check: Check
🔇 Additional comments (4)
crates/pyaugurs/src/clustering.rs (1)
85-94
: LGTM! Clean Python binding implementation.The conversion to
i32
is appropriate for Python/numpy compatibility, and the implementation correctly maps the newDbscanCluster
type.crates/augurs/tests/integration.rs (1)
21-23
: LGTM! Well-structured test adaptation.The
convert_clusters
helper function provides a clean way to adapt the tests to the new type system while maintaining readability.crates/augurs-clustering/README.md (2)
11-11
: LGTM! Import statement updated to reflect new type.The addition of
DbscanCluster
to the imports clearly shows users the available types and aligns with the new type system.
35-44
: Verify example correctness with the actual implementation.Let's ensure the example matches the actual implementation and test cases.
5ee35f3
to
12d5d73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
crates/augurs-clustering/src/lib.rs (1)
209-238
: Consider moving the large test vector to a separate file.The large test vector makes the test file harder to maintain. Consider moving it to a separate test data file, similar to how
dist.csv
is handled.- let expected = vec![ - 1, 1, 1, 1, /* ... many lines ... */ 1, 1, 1, 1, - ]; + let expected: Vec<i32> = include_str!("../data/expected_clusters.txt") + .split(',') + .map(|s| s.trim().parse().unwrap()) + .collect();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
crates/augurs-clustering/README.md
(2 hunks)crates/augurs-clustering/src/lib.rs
(7 hunks)crates/augurs/tests/integration.rs
(2 hunks)crates/pyaugurs/src/clustering.rs
(1 hunks)examples/clustering/examples/dbscan_clustering.rs
(2 hunks)js/augurs-clustering-js/src/lib.rs
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/augurs/tests/integration.rs
- crates/pyaugurs/src/clustering.rs
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: linux (ubuntu-22.04, ppc64le)
- GitHub Check: windows (windows-latest, x86)
- GitHub Check: linux (ubuntu-22.04, s390x)
- GitHub Check: musllinux (ubuntu-22.04, armv7)
- GitHub Check: windows (windows-latest, x64)
- GitHub Check: musllinux (ubuntu-22.04, aarch64)
- GitHub Check: linux (ubuntu-22.04, armv7)
- GitHub Check: musllinux (ubuntu-22.04, x86)
- GitHub Check: linux (ubuntu-22.04, aarch64)
- GitHub Check: musllinux (ubuntu-22.04, x86_64)
- GitHub Check: Test Book
- GitHub Check: linux (ubuntu-22.04, x86)
- GitHub Check: Tests
- GitHub Check: macos (macos-14, aarch64)
- GitHub Check: linux (ubuntu-22.04, x86_64)
- GitHub Check: Clippy
- GitHub Check: JS tests
- GitHub Check: Check
- GitHub Check: Prophet WASMStan component
- GitHub Check: macos (macos-13, x86_64)
🔇 Additional comments (6)
crates/augurs-clustering/src/lib.rs (3)
7-48
: Well-designed type implementation!The
DbscanCluster
enum withNonZeroU32
leverages niche optimization, reducing the enum size. The implementation includes comprehensive helper methods and proper documentation.
50-60
: Good test helper implementation!The
PartialEq<i32>
implementation is appropriately scoped to tests and handles all cases correctly.
Line range hint
98-146
: Verify all callers handle the new return type.The implementation correctly uses the new
DbscanCluster
type. Let's verify that all callers are updated to handle the new return type.✅ Verification successful
All callers properly handle the new DbscanCluster type
The change is safe as backward compatibility is maintained through:
- PartialEq implementation allowing direct comparisons with integers
- Proper conversion methods between old and new representations
- Tests and examples showing correct usage of both styles
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Find all direct callers of the fit method to verify they handle the new return type # Search for direct calls to fit rg -p "\.fit\(&.*\)" --type rust # Search for any remaining isize type annotations that might need updating rg -p ": Vec<isize>" --type rustLength of output: 2819
Script:
#!/bin/bash # Search for usage patterns of the fit return value rg -p "let .* = .*\.fit\(&.*\)" -A 3 --type rust # Search for any pattern matching or type annotations related to clusters ast-grep --pattern 'match $_ { DbscanCluster::$$ => $$$ }'Length of output: 4510
Script:
#!/bin/bash # Find DbscanCluster type definition ast-grep --pattern 'enum DbscanCluster { $$$ }' # Find convert_clusters implementation ast-grep --pattern 'fn convert_clusters($_) -> $$ { $$$ }' # Find any direct comparisons with Vec<isize> rg "assert_eq.*vec!\[-" --type rustLength of output: 651
Script:
#!/bin/bash # Find the file containing DbscanCluster definition and its usage rg -p "DbscanCluster" --type rust -A 5 # Look for any conversion functions between DbscanCluster and isize rg -p "impl.*DbscanCluster" --type rust -A 10Length of output: 8138
crates/augurs-clustering/README.md (1)
11-11
: Documentation accurately reflects the API changes!The README example clearly demonstrates the usage of the new
DbscanCluster
type.Also applies to: 35-44
examples/clustering/examples/dbscan_clustering.rs (1)
55-66
: Example correctly demonstrates the new API!The example has been properly updated to use
DbscanCluster
and includes helpful debug output.js/augurs-clustering-js/src/lib.rs (1)
47-53
: Clean JS binding implementation!The binding properly converts
DbscanCluster
toi32
for JS compatibility while maintaining the same semantics.
12d5d73
to
618abe5
Compare
Add a new `DbscanCluster` type to represent the cluster labels, which can be either a cluster ID or a special value indicating noise. The `Cluster` variant is now a `NonZeroU32` instead of an `isize`. We could use a `u32` but using a `NonZeroU32` allows us to make use of the niche optimizations that `NonZeroU32` provides and halves the size of the type. Questionable whether it's worth it though? Closes #229.
618abe5
to
48fd9e4
Compare
@coderabbitai I'd be interested to hear your thoughts on whether this more explicitly typed approach is a worthwhile tradeoff. IMO it's slightly more typing for users, but more self-explanatory - there's no implicit assumption that |
Tip For best results, initiate chat on the files or code changes. @sd2k I agree that this is a worthwhile tradeoff. The explicit typing with
While it requires slightly more typing from users, the improved clarity and removal of implicit assumptions (like ✏️ Learnings added
|
Add a new
DbscanCluster
type to represent the cluster labels, whichcan be either a cluster ID or a special value indicating noise. The
Cluster
variant is now aNonZeroU32
instead of anisize
.We could use a
u32
but using aNonZeroU32
allows us to make useof the niche optimizations that
NonZeroU32
provides and halves thesize of the type. Questionable whether it's worth it though?
Closes #229.
Summary by CodeRabbit
New Features
DbscanCluster
enum to provide a more structured representation of clustering results.Bug Fixes
Refactor
DbscanCluster
type across multiple components.