Skip to content

Commit

Permalink
Documentation improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
zkxs committed Aug 11, 2024
1 parent 200eb69 commit 521f584
Show file tree
Hide file tree
Showing 11 changed files with 73 additions and 32 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
with:
toolchain: stable
- name: Publish line_cardinality
run: cargo publish --package line_cardinality
run: cargo publish --package line_cardinality --all-features
env:
CARGO_REGISTRY_TOKEN: ${{ secrets.CARGO_REGISTRY_TOKEN_LINE_CARDINALITY }}
- name: Publish cuniq
Expand Down
4 changes: 2 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,5 +136,5 @@ cuniq is distributed in the hope that it will be useful, but WITHOUT ANY WARRANT
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the [GNU General Public License](LICENSE) for more
details.

A full list of dependencies is available in [Cargo.toml](Cargo.toml), or a breakdown of dependencies by license can be
A full list of dependencies is available in [Cargo.toml](cuniq/Cargo.toml), or a breakdown of dependencies by license can be
generated with `cargo deny list`.
2 changes: 1 addition & 1 deletion cuniq/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "cuniq"
version = "1.0.1"
version = "1.0.2"
description = "Count unique lines"
authors.workspace = true
edition.workspace = true
Expand Down
4 changes: 2 additions & 2 deletions line_cardinality/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
[package]
name = "line_cardinality"
version = "1.0.1"
version = "1.0.2"
description = "High performance line cardinality counts and estimates"
authors.workspace = true
edition.workspace = true
license.workspace = true
readme.workspace = true
readme = "README.md"
repository.workspace = true
keywords.workspace = true
categories = []
Expand Down
26 changes: 26 additions & 0 deletions line_cardinality/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# line_cardinality

A library that provides high performance line cardinality counts and estimates, including:
- Hashing with collision detection
- Hashing **without** collision detection. Note that collisions are nearly impossible for 64-bit hashes, and this has higher performance due to not having to store lines.
- [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog

Full API documentation at [docs.rs](https://docs.rs/line_cardinality/latest/line_cardinality/).
See [PERFORMANCE.md](../PERFORMANCE.md) for performance data and technical details on the benchmarking and
profile-guided optimization that went into creating line_cardinality.

## License

line_cardinality was built primarily for use with the [cuniq](../README.md) CLI tool, and is therefore released under the
same GPL-3.0-or-later license.

line_cardinality is free software: you can redistribute it and/or modify it under the terms of the
[GNU General Public License](../LICENSE) as published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

line_cardinality is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the [GNU General Public License](../LICENSE) for more
details.

A full list of dependencies is available in [Cargo.toml](Cargo.toml), or a breakdown of dependencies by license can be
generated with `cargo deny list`.
11 changes: 6 additions & 5 deletions line_cardinality/src/count_unique_impl/hashing.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ use crate::{CountUnique, EmitLines, Increment, ReportUnique};

use super::{init_hasher_state, RandomState};

/// Runs the unique count and holds necessary state. This may be expensive to drop if it contains a large
/// Calculates the unique count and holds necessary state. Internally, a [`HashMap`] is created that
/// contains an entry for each distinct line in the input. This may be expensive to drop if it contains a large
/// amount of processed data, so using [`std::mem::forget`] may be worth considering if your application
/// will terminate immediately after finishing the unique-counting work.
///
Expand All @@ -30,12 +31,12 @@ impl<T> Default for HashingLineCounter<T, ()> {

/// Constructors that do not take a custom line mapper
impl<T> HashingLineCounter<T, ()> {
/// Creates a new count_unique_impl.
/// Creates a new [`HashingLineCounter`].
pub fn new() -> Self {
Self::with_capacity(0)
}

/// Creates a new count_unique_impl with a cardinality hint of `capacity`.
/// Creates a new [`HashingLineCounter`] with a cardinality hint of `capacity`.
///
/// Note that it is best to leave `capacity` unset unless you have a near-perfect idea of your
/// data's cardinality lower bound, as it is extremely difficult to gain performance by setting
Expand All @@ -55,13 +56,13 @@ impl<T, M> HashingLineCounter<T, M>
where
M: for<'a> FnMut(&'a [u8], &'a mut Vec<u8>) -> &'a [u8],
{
/// Creates a new count_unique_impl with a custom `line_mapper` function which will be applied to
/// Creates a new [`HashingLineCounter`] with a custom `line_mapper` function which will be applied to
/// each read line before counting.
pub fn with_line_mapper(line_mapper: M) -> Self {
Self::with_line_mapper_and_capacity(line_mapper, 0)
}

/// Creates a new count_unique_impl with a cardinality hint of `capacity` and a custom
/// Creates a new [`HashingLineCounter`] with a cardinality hint of `capacity` and a custom
/// `line_mapper` function which will be applied to each read line before counting.
///
/// Note that it is best to leave `capacity` unset unless you have a near-perfect idea of your
Expand Down
19 changes: 14 additions & 5 deletions line_cardinality/src/count_unique_impl/hashing_inexact.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,19 @@ use std::hash::BuildHasher;
use hashbrown::HashTable;

use crate::count_unique_impl::init_hasher_state;
use crate::CountUnique;
use crate::{CountUnique, EmitLines};

use super::RandomState;

/// Calculates the unique count and holds necessary state. Internally, a [`HashTable`] is created that
/// contains an entry for each distinct line in the input. This may be expensive to drop if it contains a large
/// amount of processed data, so using [`std::mem::forget`] may be worth considering if your application
/// will terminate immediately after finishing the unique-counting work.
///
/// This implementation also has accepts a customizable `line_mapper` function with
/// [`InexactHashingLineCounter::with_line_mapper`]. If provided, this function will be applied to each
/// line before checking if it is unique or not. Note that this also affects the output that will be
/// seen from functions that enumerate internal state, such as [`EmitLines::for_each_line`].
pub struct InexactHashingLineCounter<M>
where
{
Expand All @@ -29,12 +38,12 @@ impl Default for InexactHashingLineCounter<()> {

/// Constructors that do not take a custom line mapper
impl InexactHashingLineCounter<()> {
/// Creates a new count_unique_impl.
/// Creates a new [`InexactHashingLineCounter`].
pub fn new() -> Self {
Self::with_capacity(0)
}

/// Creates a new count_unique_impl with a cardinality hint of `capacity`.
/// Creates a new [`InexactHashingLineCounter`] with a cardinality hint of `capacity`.
///
/// Note that it is best to leave `capacity` unset unless you have a near-perfect idea of your
/// data's cardinality lower bound, as it is extremely difficult to gain performance by setting
Expand All @@ -55,13 +64,13 @@ impl<M> InexactHashingLineCounter<M>
where
M: for<'a> FnMut(&'a [u8], &'a mut Vec<u8>) -> &'a [u8],
{
/// Creates a new count_unique_impl with a custom `line_mapper` function which will be applied to
/// Creates a new [`InexactHashingLineCounter`] with a custom `line_mapper` function which will be applied to
/// each read line before counting.
pub fn with_line_mapper(line_mapper: M) -> Self {
Self::with_line_mapper_and_capacity(line_mapper, 0)
}

/// Creates a new count_unique_impl with a cardinality hint of `capacity` and a custom
/// Creates a new [`InexactHashingLineCounter`] with a cardinality hint of `capacity` and a custom
/// `line_mapper` function which will be applied to each read line before counting.
///
/// Note that it is best to leave `capacity` unset unless you have a near-perfect idea of your
Expand Down
28 changes: 14 additions & 14 deletions line_cardinality/src/count_unique_impl/hyperloglog.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,22 @@ use std::f64::consts::E;
#[cfg(not(feature = "ahash"))]
use std::hash::BuildHasher;

use crate::{CountUnique, Error};
use crate::{CountUnique, EmitLines, Error};

use super::{init_hasher_state, RandomState};

type Hash = u64;

const DEFAULT_SIZE: usize = 65536;

/// Estimates the unique count and holds necessary state. The estimate is performed using
/// [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog), a state-of-the art cardinality
/// approximation algorithm. This uses constant memory
///
/// This implementation also has accepts a customizable `line_mapper` function with
/// [`HyperLogLog::with_line_mapper`]. If provided, this function will be applied to each
/// line before checking if it is unique or not. Note that this also affects the output that will be
/// seen from functions that enumerate internal state, such as [`EmitLines::for_each_line`].
pub struct HyperLogLog<M> {
random_state: RandomState,
size: usize,
Expand Down Expand Up @@ -58,16 +66,12 @@ impl Default for HyperLogLog<()> {

/// Constructors that do not take a custom line mapper
impl HyperLogLog<()> {
/// Creates a new count_unique_impl.
/// Creates a new [`HyperLogLog`] with 65536 bytes of memory used to store state.
pub fn new() -> Self {
Self::with_capacity(DEFAULT_SIZE).unwrap()
}

/// Creates a new count_unique_impl with a cardinality hint of `capacity`.
///
/// Note that it is best to leave `capacity` unset unless you have a near-perfect idea of your
/// data's cardinality lower bound, as it is extremely difficult to gain performance by setting
/// it, but extremely easy to lose performance.
/// Creates a new [`HyperLogLog`] with `size` bytes of memory used to store state.
pub fn with_capacity(size: usize) -> Result<Self, Error> {
let SizeInfo { bits, shift_bits, mask } = check_size(size)?;
Ok(HyperLogLog {
Expand All @@ -88,18 +92,14 @@ impl<M> HyperLogLog<M>
where
M: for<'a> FnMut(&'a [u8], &'a mut Vec<u8>) -> &'a [u8],
{
/// Creates a new count_unique_impl with a custom `line_mapper` function which will be applied to
/// each read line before counting.
/// Creates a new [`HyperLogLog`] with 65536 bytes of memory used to store state and a custom
/// `line_mapper` function which will be applied to each read line before counting.
pub fn with_line_mapper(line_mapper: M) -> Self {
Self::with_line_mapper_and_capacity(line_mapper, DEFAULT_SIZE).unwrap()
}

/// Creates a new count_unique_impl with a cardinality hint of `capacity` and a custom
/// Creates a new [`HyperLogLog`] with `size` bytes of memory used to store state and a custom
/// `line_mapper` function which will be applied to each read line before counting.
///
/// Note that it is best to leave `capacity` unset unless you have a near-perfect idea of your
/// data's cardinality lower bound, as it is extremely difficult to gain performance by setting
/// it, but extremely easy to lose performance.
pub fn with_line_mapper_and_capacity(line_mapper: M, size: usize) -> Result<Self, Error> {
let SizeInfo { bits, shift_bits, mask } = check_size(size)?;
Ok(HyperLogLog {
Expand Down
1 change: 1 addition & 0 deletions line_cardinality/src/count_unique_impl/result.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ enum Message {
Static(&'static str),
}

/// Contains the cause of an [`Error`]
#[derive(Debug)]
pub enum Cause {
/// IO error
Expand Down
6 changes: 5 additions & 1 deletion line_cardinality/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
// This file is part of line_cardinality. Copyright © 2024 line_cardinality contributors.
// line_cardinality is licensed under the GNU GPL v3.0 or any later version. See LICENSE file for full text.

//! line_cardinality provides utilities to count unique lines from input data. It can read from a
//! line_cardinality provides utilities to count or estimate unique lines from input data. It can read from a
//! [`BufRead`] (such as stdin) or a file using optimized file reading functions.
//!
//! Note line_cardinality only supports newline (`\n`) delimited input and does not perform any
//! UTF-8 validation: all lines are compared by byte value alone.
//!
//! Examples of counting total distinct lines can be found in [`CountUnique`].
//!
//! Examples of reporting occurrences of each distinct line can be found in [`ReportUnique`].
#![feature(hash_raw_entry)]

Expand Down

0 comments on commit 521f584

Please sign in to comment.