Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression support #79

Merged
merged 136 commits into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
136 commits
Select commit Hold shift + click to select a range
42c046e
start a huffman encder
KillingSpark Oct 7, 2024
a0d2031
build huffman table according to spec
KillingSpark Oct 8, 2024
246e609
calc max num bits from weights
KillingSpark Oct 8, 2024
35833ba
create valid weight distributions
KillingSpark Oct 10, 2024
7f5ac83
implement redistributing weights to fit a given maximum weight
KillingSpark Oct 10, 2024
05788ac
fix chosen weight to redistribute
KillingSpark Oct 10, 2024
48efed9
fix huffman code distribution
KillingSpark Oct 11, 2024
bad020d
build huffman table for counts
KillingSpark Oct 11, 2024
0e29503
allow building from a slice of data
KillingSpark Oct 11, 2024
39e612d
start actual huffman encoder
KillingSpark Oct 11, 2024
190f08c
test for prefix free codes, shows it's still broken
KillingSpark Oct 11, 2024
a7cd816
fix broken prefix allocation
KillingSpark Oct 11, 2024
e799da7
test more weight distributions
KillingSpark Oct 11, 2024
f7d32c7
add roundtrip test for huffman coding that's fit to also be fuzzed
KillingSpark Oct 11, 2024
cd794a8
fix roundtrip test
KillingSpark Oct 11, 2024
1980dbf
huffman streams are encoded backwards
KillingSpark Oct 11, 2024
90ede9e
rewrite bit writer to fill bytes starting at the lower bits
KillingSpark Oct 12, 2024
bfa1c9e
start fuzzing huffman encoder
KillingSpark Oct 12, 2024
289e664
test that max num bits doesnt get too big
KillingSpark Oct 12, 2024
df60a7a
only execute fuzz targets when present
KillingSpark Oct 12, 2024
4c05304
Fix weight redistribution to limit to a specific number of bits
KillingSpark Oct 12, 2024
6d4920d
add two crashes that are already fixed
KillingSpark Oct 12, 2024
2726eaa
expand test to cover a weight distribution of all sizes
KillingSpark Oct 12, 2024
9d07056
cargo fmt
KillingSpark Oct 12, 2024
10ee6e1
no need to gate encoding behind std
KillingSpark Oct 12, 2024
f7a5ad2
mak clippy happy
KillingSpark Oct 12, 2024
ed219cc
more std import shenanigans
KillingSpark Oct 12, 2024
0e2435b
cargo fmt
KillingSpark Oct 12, 2024
02c4789
start fse encoder
KillingSpark Oct 13, 2024
d5e2543
need different table representation for encoding
KillingSpark Oct 13, 2024
96925de
build fse table for encoding
KillingSpark Oct 13, 2024
22bfdb3
start test for fse table creation, fix edgecase for 0 probability sym…
KillingSpark Oct 13, 2024
9039fd9
table generation now matches
KillingSpark Oct 13, 2024
4ca250d
first happy fse roundtrip
KillingSpark Oct 13, 2024
b97f962
cargo fmt
KillingSpark Oct 13, 2024
55b2918
unnecessary import
KillingSpark Oct 13, 2024
fd9c968
more roundtrips
KillingSpark Oct 13, 2024
3221f28
fuzz fse encoder
KillingSpark Oct 13, 2024
2f4e5d0
minimum acc_log is 5
KillingSpark Oct 13, 2024
78ce10a
implement fse table encoding
KillingSpark Oct 13, 2024
8ebd0db
misaligned should never report 8
KillingSpark Oct 13, 2024
b7b7d0b
encode huffman tables according to spec
KillingSpark Oct 15, 2024
561d7d5
start compression mode based solely on compressing literals
KillingSpark Oct 15, 2024
91d46cc
add test for a simple frame encode/decode cycle
KillingSpark Oct 15, 2024
9c3e610
fix blocksize for compressed blocks, a block may only contain 1<<18 -…
KillingSpark Oct 16, 2024
fc48aa5
add support for RLE blocks
KillingSpark Oct 16, 2024
e13886f
start fuzzing compression interop
KillingSpark Oct 16, 2024
cb8d4b7
added support for literals less than 5
KillingSpark Oct 16, 2024
402d180
fix single stream encoding
KillingSpark Oct 16, 2024
454ad7b
add compression and decoding to encoding fuzzer
KillingSpark Oct 16, 2024
fc91d6e
add compression and decoding to encoding fuzzer
KillingSpark Oct 16, 2024
1106501
huffman weights may only use a fse compression with 6 bits
KillingSpark Oct 17, 2024
eeb940f
implement decreasing the fse probabilities to fit max_log
KillingSpark Oct 17, 2024
61d5fe3
remove table modes from empty sequence sections
KillingSpark Oct 17, 2024
0176e0f
check in decoder if a sequence section with 0 sequences contains noth…
KillingSpark Oct 17, 2024
3d259ea
not using single segment frames and a window size allows inefficient …
KillingSpark Oct 17, 2024
45d5fba
switch back to huffman encoding literals
KillingSpark Oct 17, 2024
1a19804
improve interop fuzz
KillingSpark Oct 17, 2024
facb5d5
make window size big enough to fit whole 128kb block into it
KillingSpark Oct 18, 2024
932a12a
extend encodecorpus to test compression
KillingSpark Oct 18, 2024
3df96dd
if compressing a block makes it larger just encode it as a raw block …
KillingSpark Oct 18, 2024
79e1c99
raise the first fse probability to avoid 0 num bit states not the last
KillingSpark Oct 18, 2024
5ca4eaa
cargo fmt
KillingSpark Oct 18, 2024
443cac8
make clippy happy
KillingSpark Oct 18, 2024
aaad836
bitwriter can now work on a &mut Vec as output
KillingSpark Oct 20, 2024
e95fd06
4x huffman encoding doesnt use 4 vecs anymore
KillingSpark Oct 20, 2024
a8ab25e
add compression support to zstd binary
KillingSpark Oct 20, 2024
79c2491
optimize bitwriter to collect bits in a u64 before flushing to the ou…
KillingSpark Oct 20, 2024
0f96d67
make source a generic read impl
KillingSpark Oct 20, 2024
39aac8e
make the target a generic write impl
KillingSpark Oct 20, 2024
c83918b
todo
KillingSpark Oct 20, 2024
eeb4e08
remove unnecessary code from hot function
KillingSpark Oct 21, 2024
aea6a90
remove one += from hot function
KillingSpark Oct 21, 2024
a355e30
more efficient check for hot path
KillingSpark Oct 21, 2024
00e61d6
dry common code
KillingSpark Oct 21, 2024
6b26b2d
employ bufreader in zstd binary
KillingSpark Oct 23, 2024
8493d58
start match finding algorithm
KillingSpark Oct 28, 2024
262c034
start implementing a skip list for use in the match generator
KillingSpark Oct 29, 2024
e175cd9
use hashmap to do find indexes of matches of a minimum length
KillingSpark Nov 17, 2024
a6024dc
only return None if the whole last data slice has been processed
KillingSpark Nov 17, 2024
219dbe2
matcher can now also accept data that doesnt need matches
KillingSpark Nov 17, 2024
dc8068a
start working on encoding the sequences
KillingSpark Nov 17, 2024
14bc05d
implement translation tables for ll/ml/of tables to codes and additio…
KillingSpark Nov 17, 2024
c9f4f03
implement FSE encoding sequence codes
KillingSpark Nov 17, 2024
8457c25
default tables
KillingSpark Nov 18, 2024
44fcb62
offset in last window slice is calculated differently
KillingSpark Nov 18, 2024
845008d
properly encode number of sequences
KillingSpark Nov 18, 2024
3f34638
cargo fmt
KillingSpark Nov 18, 2024
5b0a142
matcher now correctly returns a last literals sequence for the last u…
Nov 21, 2024
e4fcb81
add debug assertions that check that matches are correctly generated
Nov 21, 2024
37ec1e8
simplify offset calc, doesn't need a special case for the last slice
Nov 21, 2024
fc2a5cb
offsets are encoded as +3 to allow 1,2,3 to represent the last 3 offsets
Nov 21, 2024
90e0e62
fix error in default table impl
Nov 21, 2024
a5774de
fix edgecases around small data slices or few sequences
Nov 21, 2024
25315c9
remove warning
Nov 21, 2024
7d32f52
fix doc test
Nov 22, 2024
70e1501
fix fuzzing
Nov 22, 2024
677ddf4
make clippy happy
Nov 22, 2024
741a3f9
if literals get bigger by encoding them with a huffman tree, just wri…
KillingSpark Nov 22, 2024
fd0ae1f
expand nostd_io module to include write_all, read_to_end and impl Wri…
KillingSpark Nov 22, 2024
b3f20f6
use slices instead of fixed size arrays as key for match generator
Nov 25, 2024
247d870
use more efficient method of windowing slices
Nov 25, 2024
0530eb6
cargo fmt
Nov 25, 2024
54939d4
disable clippy lint for collapsible if. I think it's clearer this way…
Nov 28, 2024
f9c2484
encoding now actually streams the data instead of reading the source …
KillingSpark Nov 28, 2024
1690bd9
delete residual eprintln
KillingSpark Nov 28, 2024
48fac5a
make clippy happy
KillingSpark Nov 28, 2024
afc8cc7
remove decoding check from zstd binary
KillingSpark Nov 28, 2024
cb07e1c
unused import
KillingSpark Nov 28, 2024
d81c76b
zstd binary reports compression progress now
KillingSpark Nov 28, 2024
7cbeeac
this definitely still needs work...
KillingSpark Nov 28, 2024
421d1c1
No need for a hashtable actually
KillingSpark Nov 29, 2024
8b42cb6
make clippy happy
KillingSpark Nov 29, 2024
699c075
better key function
Nov 29, 2024
899410d
cargo fmt
Nov 29, 2024
79e45c4
even better key function
Nov 29, 2024
f98a837
fix multiply overflow
KillingSpark Nov 29, 2024
7d23961
reenable test
KillingSpark Nov 29, 2024
869a7bf
better poly in key function
KillingSpark Nov 29, 2024
d17af02
improve match length finding
Dec 2, 2024
feb6f5f
inline makes a significant difference
Dec 2, 2024
228353b
document the matcher trait and make it slightly easier to use
Dec 2, 2024
9fbe7dd
cargo fmt
Dec 2, 2024
8988bc3
fix hash collision in test
Dec 2, 2024
b0c65fa
also reuse suffix store allocations
Dec 2, 2024
e99c64f
fix test and enable niche optimization for Option<NonZeroUsize>
Dec 2, 2024
124793c
std -> core import
Dec 2, 2024
7819c14
small optimization around fse encoding
Dec 4, 2024
28c3084
doc and comments in fse encoder
Dec 12, 2024
61829d7
doc and comments in huff0 encoder
Dec 13, 2024
b9029f0
more doc comments
Dec 13, 2024
a56a40b
doc and comments in match generator
Dec 13, 2024
4499aac
more doc and slightly better Matcher api
Dec 13, 2024
1237041
add convenience functions for compressing data
Dec 13, 2024
99dee17
update readme and changelog
Dec 13, 2024
6c2bcd7
cargo fmt
Dec 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,4 @@ This document records the changes made between versions, starting with version 0
* Added convenience functions to FrameDecoder to decode multiple frames from a buffer (https://github.com/philipc)

# After 0.7.3

* Add initial compression support
14 changes: 9 additions & 5 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,7 @@
A pure Rust implementation of the Zstandard compression algorithm, as defined in [this document](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md).

This crate contains a fully operational implementation of the decompression portion of the standard.

*Work has started on a compressor, but it has not reached a point where the compressor provides any real function.* (CONTRIBUTORS WELCOME)
It also provides a compressor which is usable, but it does not yet reach the speed, ratio or configurability of the original zstd library.

This crate is currently actively maintained.

Expand All @@ -19,9 +18,14 @@ This crate is currently actively maintained.
Feature complete on the decoder side. In terms of speed it is still behind the original C implementation which has a rust binding located [here](https://github.com/gyscos/zstd-rs).

On the compression side:
- [x] Support for generating raw, uncompressed frames
- [ ] Support for generating RLE compressed blocks
- [ ] Support for generating compressed blocks at any compression level
- Support for generating compressed blocks at any compression level
- [x] Uncompressed
- [x] Fastest (roughly level 1)
- [ ] Default (roughly level 3)
- [ ] Better (roughly level 7)
- [ ] Best (roughly level 11)
- [ ] Checksums
- [ ] Dictionaries

## Speed

Expand Down
8 changes: 8 additions & 0 deletions fuzz/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,11 @@ path = "fuzz_targets/encode.rs"
[[bin]]
name = "interop"
path = "fuzz_targets/interop.rs"

[[bin]]
name = "huff0"
path = "fuzz_targets/huff0.rs"

[[bin]]
name = "fse"
path = "fuzz_targets/fse.rs"
Binary file not shown.
Empty file.
Binary file not shown.
Empty file.
Binary file not shown.
Binary file not shown.
Binary file not shown.
19 changes: 16 additions & 3 deletions fuzz/fuzz_targets/encode.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,21 @@ extern crate ruzstd;
use ruzstd::encoding::{FrameCompressor, CompressionLevel};

fuzz_target!(|data: &[u8]| {
let mut content = data;
let mut compressor = FrameCompressor::new(data, CompressionLevel::Uncompressed);
let mut output = Vec::new();
compressor.compress(&mut output);
let mut compressor = FrameCompressor::new(data, &mut output, CompressionLevel::Uncompressed);
compressor.compress();

let mut decoded = Vec::with_capacity(data.len());
let mut decoder = ruzstd::FrameDecoder::new();
decoder.decode_all_to_vec(&output, &mut decoded).unwrap();
assert_eq!(data, &decoded);

let mut output = Vec::new();
let mut compressor = FrameCompressor::new(data, &mut output, CompressionLevel::Fastest);
compressor.compress();

let mut decoded = Vec::with_capacity(data.len());
let mut decoder = ruzstd::FrameDecoder::new();
decoder.decode_all_to_vec(&output, &mut decoded).unwrap();
assert_eq!(data, &decoded);
});
8 changes: 8 additions & 0 deletions fuzz/fuzz_targets/fse.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#![no_main]
#[macro_use] extern crate libfuzzer_sys;
extern crate ruzstd;
use ruzstd::fse::round_trip;

fuzz_target!(|data: &[u8]| {
round_trip(data);
});
8 changes: 8 additions & 0 deletions fuzz/fuzz_targets/huff0.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#![no_main]
#[macro_use] extern crate libfuzzer_sys;
extern crate ruzstd;
use ruzstd::huff0::round_trip;

fuzz_target!(|data: &[u8]| {
round_trip(data);
});
21 changes: 19 additions & 2 deletions fuzz/fuzz_targets/interop.rs
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,19 @@ fn encode_zstd(data: &[u8]) -> Result<Vec<u8>, std::io::Error> {

fn encode_ruzstd_uncompressed(data: &mut dyn std::io::Read) -> Vec<u8> {
let mut input = Vec::new();
let mut output = Vec::new();
data.read_to_end(&mut input).unwrap();
let mut compressor = ruzstd::encoding::FrameCompressor::new(&input, ruzstd::encoding::CompressionLevel::Uncompressed);
let mut compressor = ruzstd::encoding::FrameCompressor::new(input.as_slice(), &mut output, ruzstd::encoding::CompressionLevel::Uncompressed);
compressor.compress();
output
}

fn encode_ruzstd_compressed(data: &mut dyn std::io::Read) -> Vec<u8> {
let mut input = Vec::new();
let mut output = Vec::new();
compressor.compress(&mut output);
data.read_to_end(&mut input).unwrap();
let mut compressor = ruzstd::encoding::FrameCompressor::new(input.as_slice(), &mut output, ruzstd::encoding::CompressionLevel::Fastest);
compressor.compress();
output
}

Expand Down Expand Up @@ -69,4 +78,12 @@ fuzz_target!(|data: &[u8]| {
decoded, data,
"Decoded data did not match the original input during compression"
);
// Compressed encoding
let mut input = data;
let compressed = encode_ruzstd_compressed(&mut input);
let decoded = decode_zstd(&compressed).unwrap();
assert_eq!(
decoded, data,
"Decoded data did not match the original input during compression"
);
});
67 changes: 62 additions & 5 deletions src/bin/zstd.rs
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
extern crate ruzstd;
use std::fs::File;
use std::io::BufReader;
use std::io::Read;
use std::io::Seek;
use std::io::SeekFrom;
use std::io::Write;
use std::time::Instant;

use ruzstd::encoding::CompressionLevel;
use ruzstd::encoding::FrameCompressor;
use ruzstd::frame::ReadFrameHeaderError;
use ruzstd::frame_decoder::FrameDecoderError;

Expand All @@ -18,11 +22,7 @@ struct StateTracker {
old_percentage: i8,
}

fn main() {
let mut file_paths: Vec<_> = std::env::args().filter(|f| !f.starts_with('-')).collect();
let flags: Vec<_> = std::env::args().filter(|f| f.starts_with('-')).collect();
file_paths.remove(0);

fn decompress(flags: &[String], file_paths: &[String]) {
if !flags.contains(&"-d".to_owned()) {
eprintln!("This zstd implementation only supports decompression. Please add a \"-d\" flag");
return;
Expand Down Expand Up @@ -128,6 +128,63 @@ fn main() {
}
}

struct PercentPrintReader<R: Read> {
total: usize,
counter: usize,
last_percent: usize,
reader: R,
}

impl<R: Read> Read for PercentPrintReader<R> {
fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
let new_bytes = self.reader.read(buf)?;
self.counter += new_bytes;
let progress = self.counter * 100 / self.total;
if progress > self.last_percent {
self.last_percent = progress;
eprint!("\r");
eprint!("{} % done", progress);
}
Ok(new_bytes)
}
}

fn main() {
let mut file_paths: Vec<_> = std::env::args().filter(|f| !f.starts_with('-')).collect();
let flags: Vec<_> = std::env::args().filter(|f| f.starts_with('-')).collect();
file_paths.remove(0);

if flags.is_empty() {
for path in file_paths {
let start_instant = Instant::now();
let file = std::fs::File::open(&path).unwrap();
let input_len = file.metadata().unwrap().len() as usize;
let file = PercentPrintReader {
reader: BufReader::new(file),
total: input_len,
counter: 0,
last_percent: 0,
};
let mut output = Vec::new();
let mut encoder = FrameCompressor::new(file, &mut output, CompressionLevel::Fastest);
encoder.compress();
println!(
"Compressed {path:} from {} to {} ({}%) took {}ms",
input_len,
output.len(),
if input_len == 0 {
0
} else {
output.len() * 100 / input_len
},
start_instant.elapsed().as_millis()
);
}
} else {
decompress(&flags, &file_paths);
}
}

fn do_something(data: &[u8], s: &mut StateTracker) {
//Do something. Like writing it to a file or to stdout...
std::io::stdout().write_all(data).unwrap();
Expand Down
7 changes: 7 additions & 0 deletions src/decoding/block_decoder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -447,6 +447,13 @@ impl BlockDecoder {
vprintln!("Executing sequences");
execute_sequences(workspace)?;
} else {
if !raw.is_empty() {
return Err(DecompressBlockError::DecodeSequenceError(
DecodeSequenceError::ExtraBits {
bits_remaining: raw.len() as isize * 8,
},
));
}
workspace.buffer.push(&workspace.literals_buffer);
workspace.sequences.clear();
}
Expand Down
2 changes: 1 addition & 1 deletion src/decoding/decodebuffer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ impl DecodeBuffer {
amount: usize,
}

impl<'a> Drop for DrainGuard<'a> {
impl Drop for DrainGuard<'_> {
fn drop(&mut self) {
if self.amount != 0 {
self.buffer.drop_first_n(self.amount);
Expand Down
1 change: 1 addition & 0 deletions src/decoding/ringbuffer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -362,6 +362,7 @@ impl RingBuffer {
unsafe { copy_bytes_overshooting(src, dst, len - after_tail) }
}
} else {
#[allow(clippy::collapsible_else_if)]
if self.head + start > self.cap {
// Continuous read section and destination section:
//
Expand Down
Loading
Loading