Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hardcoded filter3x3 function to benchmark current generic filter function against #644

Closed
theotherphil opened this issue May 26, 2024 · 19 comments

Comments

@theotherphil
Copy link
Contributor

No description provided.

@ripytide
Copy link
Member

It seems #646 adds an implementation for 3x3, 5x5, 7x7, this might be closable then. Or is there something I'm missing with this issue as compared to #643?

@theotherphil
Copy link
Contributor Author

This issue is to add an implementation of filter3x3 along the lines of what we had when this library was first written - a raw loop that averages the pixel values at 9 fixed offsets from the current coordinate - to check that we don’t accidentally end up with an implementation that’s slower than that for this use case.

@cospectrum
Copy link
Contributor

The filter has become 4 times slower

@cospectrum
Copy link
Contributor

cospectrum commented May 27, 2024

malloc for each pixel 😮‍💨

@ripytide
Copy link
Member

Not if the compiler optimizes it out, which I'm curious about seeing as it was the same speed as the previous implementation according my benchmarking. I'm hoping the Pixel trait will get compile-time length info at some point which would allow us to remove the heap allocations.

One method would be if image switches to rgb's 0.9 release which I'm working on atm.

@theotherphil
Copy link
Contributor Author

@cospectrum which versions are you comparing?

@cospectrum
Copy link
Contributor

@cospectrum which versions are you comparing?

master and 0.25

@theotherphil
Copy link
Contributor Author

Ouch. I’ll take a look today. @ripytide has benchmarks showing no impact at #608 (comment) so something is odd here.

@theotherphil
Copy link
Contributor Author

theotherphil commented May 27, 2024

Comparing v0.25.0 and current master (5e00bd1) on a 2021 MacBook Pro with an M1 Max chip.

The function signatures and set of benchmarks have both changed across this range so I've

  • Created a filter5x5 and filter7x7 function in version 0.25 which are just copy-pasted from filter3x3 with a different kernel size
  • Added the following benchmarks
// For v0.25
    #[bench]
    fn bench_filter3x3_i32_filter(b: &mut Bencher) {
        let image = gray_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1,
            -2, 0, 2,
            -1, 0, 1
        ];

        b.iter(|| {
            let filtered: ImageBuffer<Luma<i16>, Vec<i16>> =
                filter3x3::<_, _, i16>(&image, &kernel);
            black_box(filtered);
        });
    }

    #[bench]
    fn bench_filter5x5_i32_filter(b: &mut Bencher) {
        let image = gray_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1, 2, 3,
            -2, 0, 2, 3, 5,
            -1, 0, 1, 6, 9,
            -1, 0, 1, 6, 9,
            -1, 0, 1, 6, 9
        ];

        b.iter(|| {
            let filtered: image::ImageBuffer<Luma<i16>, Vec<i16>> =
            filter5x5::<_, _, i16>(&image, &kernel);
            black_box(filtered);
        });
    }

    #[bench]
    fn bench_filter7x7_i32_filter(b: &mut Bencher) {
        let image = gray_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1, 2, 3, -2, 1,
            -2, 0, 2, 3, 5, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1
        ];

        b.iter(|| {
            let filtered: image::ImageBuffer<Luma<i16>, Vec<i16>> =
            filter7x7::<_, _, i16>(&image, &kernel);
            black_box(filtered);
        });
    }
// For master
    #[bench]
    fn bench_filter3x3_i32_filter(b: &mut Bencher) {
        let image = gray_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1,
            -2, 0, 2,
            -1, 0, 1
        ];
        let kernel = Kernel::new(&kernel, 3, 3);

        b.iter(|| {
            let filtered: image::ImageBuffer<Luma<i16>, Vec<i16>> =
            filter_clamped::<_, _, i16>(&image, kernel);
            black_box(filtered);
        });
    }

    #[bench]
    fn bench_filter5x5_i32_filter(b: &mut Bencher) {
        let image = gray_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1, 2, 3,
            -2, 0, 2, 3, 5,
            -1, 0, 1, 6, 9,
            -1, 0, 1, 6, 9,
            -1, 0, 1, 6, 9
        ];
        let kernel = Kernel::new(&kernel, 5, 5);

        b.iter(|| {
            let filtered: image::ImageBuffer<Luma<i16>, Vec<i16>> =
            filter_clamped::<_, _, i16>(&image, kernel);
            black_box(filtered);
        });
    }

    #[bench]
    fn bench_filter7x7_i32_filter(b: &mut Bencher) {
        let image = gray_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1, 2, 3, -2, 1,
            -2, 0, 2, 3, 5, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1
        ];
        let kernel = Kernel::new(&kernel, 7, 7);

        b.iter(|| {
            let filtered: image::ImageBuffer<Luma<i16>, Vec<i16>> =
            filter_clamped::<_, _, i16>(&image, kernel);
            black_box(filtered);
        });
    }

Results:

// Master
filter::benches::bench_filter3x3_i32_filter                                      ... bench:   2,113,838.52 ns/iter (+/- 146,344.38)
filter::benches::bench_filter5x5_i32_filter                                      ... bench:   5,102,450.00 ns/iter (+/- 1,087,624.48)
filter::benches::bench_filter7x7_i32_filter                                      ... bench:   9,306,050.00 ns/iter (+/- 1,542,055.76)

// v0.25
filter::benches::bench_filter3x3_i32_filter                                      ... bench:   1,800,742.73 ns/iter (+/- 97,078.87)
filter::benches::bench_filter5x5_i32_filter                                      ... bench:   2,032,997.93 ns/iter (+/- 444,828.73)
filter::benches::bench_filter7x7_i32_filter                                      ... bench:   4,564,652.15 ns/iter (+/- 1,259,277.64)

So that's 1.17x, 2.55x and 2.04x slowdowns for 3x3, 5x5 and 7x7 kernels on 500x500 grayscale images.

Master now has benchmarks for 3x3, 5x5 and 7x7 images as standard, although with the image size reduced to 300x300, so these comparisons should be easier in future.

@ripytide
Copy link
Member

@theotherphil could you push your modified v0.25 branch so I can use that for my benchmark?

@theotherphil
Copy link
Contributor Author

This branch has the new benchmarks plus filter5x5 and filter7x7: https://github.com/image-rs/imageproc/tree/v0.25.0-branch

@theotherphil
Copy link
Contributor Author

I'll add some for RGB images too, in case the regressions are larger than than for grayscale.

@theotherphil
Copy link
Contributor Author

RGB:

// v0.25
    #[bench]
    fn bench_filter3x3_i32_filter_rgb(b: &mut Bencher) {
        let image = rgb_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1,
            -2, 0, 2,
            -1, 0, 1
        ];

        b.iter(|| {
            let filtered: ImageBuffer<Rgb<i16>, Vec<i16>> =
                filter3x3::<_, _, i16>(&image, &kernel);
            black_box(filtered);
        });
    }

    #[bench]
    fn bench_filter5x5_i32_filter_rgb(b: &mut Bencher) {
        let image = rgb_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1, 2, 3,
            -2, 0, 2, 3, 5,
            -1, 0, 1, 6, 9,
            -1, 0, 1, 6, 9,
            -1, 0, 1, 6, 9
        ];

        b.iter(|| {
            let filtered: image::ImageBuffer<Rgb<i16>, Vec<i16>> =
            filter5x5::<_, _, i16>(&image, &kernel);
            black_box(filtered);
        });
    }

    #[bench]
    fn bench_filter7x7_i32_filter_rgb(b: &mut Bencher) {
        let image = rgb_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1, 2, 3, -2, 1,
            -2, 0, 2, 3, 5, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1
        ];

        b.iter(|| {
            let filtered: image::ImageBuffer<Rgb<i16>, Vec<i16>> =
            filter7x7::<_, _, i16>(&image, &kernel);
            black_box(filtered);
        });
    }
// Master
    #[bench]
    fn bench_filter3x3_i32_filter_rgb(b: &mut Bencher) {
        let image = rgb_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1,
            -2, 0, 2,
            -1, 0, 1
        ];
        let kernel = Kernel::new(&kernel, 3, 3);

        b.iter(|| {
            let filtered: image::ImageBuffer<Rgb<i16>, Vec<i16>> =
            filter_clamped::<_, _, i16>(&image, kernel);
            black_box(filtered);
        });
    }

    #[bench]
    fn bench_filter5x5_i32_filter_rgb(b: &mut Bencher) {
        let image = rgb_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1, 2, 3,
            -2, 0, 2, 3, 5,
            -1, 0, 1, 6, 9,
            -1, 0, 1, 6, 9,
            -1, 0, 1, 6, 9
        ];
        let kernel = Kernel::new(&kernel, 5, 5);

        b.iter(|| {
            let filtered: image::ImageBuffer<Rgb<i16>, Vec<i16>> =
            filter_clamped::<_, _, i16>(&image, kernel);
            black_box(filtered);
        });
    }

    #[bench]
    fn bench_filter7x7_i32_filter_rgb(b: &mut Bencher) {
        let image = rgb_bench_image(500, 500);
        #[rustfmt::skip]
        let kernel: Vec<i32> = vec![
            -1, 0, 1, 2, 3, -2, 1,
            -2, 0, 2, 3, 5, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1,
            -1, 0, 1, 6, 9, -2, 1
        ];
        let kernel = Kernel::new(&kernel, 7, 7);

        b.iter(|| {
            let filtered: image::ImageBuffer<Rgb<i16>, Vec<i16>> =
            filter_clamped::<_, _, i16>(&image, kernel);
            black_box(filtered);
        });
    }

Results:

// Master
filter::benches::bench_filter3x3_i32_filter_rgb                                  ... bench:  19,248,650.10 ns/iter (+/- 2,161,878.78)
filter::benches::bench_filter5x5_i32_filter_rgb                                  ... bench:  26,113,808.30 ns/iter (+/- 1,566,443.61)
filter::benches::bench_filter7x7_i32_filter_rgb                                  ... bench:  33,759,604.10 ns/iter (+/- 819,419.97)

// v0.25
filter::benches::bench_filter3x3_i32_filter_rgb                                  ... bench:   2,414,054.20 ns/iter (+/- 51,805.40)
filter::benches::bench_filter5x5_i32_filter_rgb                                  ... bench:   5,215,018.75 ns/iter (+/- 1,126,537.41)
filter::benches::bench_filter7x7_i32_filter_rgb                                  ... bench:  10,263,666.60 ns/iter (+/- 364,188.05)

The regressions are much larger for RGB images.

@ripytide
Copy link
Member

0.25

test filter::benches::bench_bilateral_filter                                          ... bench:  75,295,576 ns/iter (+/- 211,489)
test filter::benches::bench_box_filter                                                ... bench:   2,070,778 ns/iter (+/- 12,442)
test filter::benches::bench_filter3x3_i32_filter                                      ... bench:   3,099,127 ns/iter (+/- 24,920)
test filter::benches::bench_filter3x3_i32_filter_rgb                                  ... bench:   3,375,630 ns/iter (+/- 131,017)
test filter::benches::bench_filter5x5_i32_filter                                      ... bench:   2,390,178 ns/iter (+/- 92,977)
test filter::benches::bench_filter5x5_i32_filter_rgb                                  ... bench:   9,743,942 ns/iter (+/- 164,547)
test filter::benches::bench_filter7x7_i32_filter                                      ... bench:   4,318,360 ns/iter (+/- 12,310)
test filter::benches::bench_filter7x7_i32_filter_rgb                                  ... bench:  16,803,183 ns/iter (+/- 749,961)
test filter::benches::bench_gaussian_f32_stdev_1                                      ... bench:     287,123 ns/iter (+/- 3,185)
test filter::benches::bench_gaussian_f32_stdev_10                                     ... bench:   1,384,883 ns/iter (+/- 10,495)
test filter::benches::bench_gaussian_f32_stdev_3                                      ... bench:     527,646 ns/iter (+/- 3,742)
test filter::benches::bench_horizontal_filter                                         ... bench:     904,917 ns/iter (+/- 8,832)
test filter::benches::bench_separable_filter                                          ... bench:     666,254 ns/iter (+/- 2,835)
test filter::benches::bench_vertical_filter                                           ... bench:     940,404 ns/iter (+/- 4,227)

Master

test filter::benches::bench_box_filter                                                ... bench:   2,064,973 ns/iter (+/- 3,673)
test filter::benches::bench_filter_clamped_gray_3x3                                   ... bench:   1,160,258 ns/iter (+/- 42,470)
test filter::benches::bench_filter_clamped_gray_5x5                                   ... bench:   2,610,301 ns/iter (+/- 5,058)
test filter::benches::bench_filter_clamped_gray_7x7                                   ... bench:   4,679,867 ns/iter (+/- 16,102)
test filter::benches::bench_filter_clamped_parallel_gray_3x3                          ... bench:     295,250 ns/iter (+/- 138,320)
test filter::benches::bench_filter_clamped_parallel_gray_5x5                          ... bench:     601,325 ns/iter (+/- 53,330)
test filter::benches::bench_filter_clamped_parallel_gray_7x7                          ... bench:   1,075,725 ns/iter (+/- 44,041)
test filter::benches::bench_gaussian_f32_stdev_1                                      ... bench:     295,152 ns/iter (+/- 1,174)
test filter::benches::bench_gaussian_f32_stdev_10                                     ... bench:   1,383,488 ns/iter (+/- 5,924)
test filter::benches::bench_gaussian_f32_stdev_3                                      ... bench:     527,800 ns/iter (+/- 2,275)
test filter::benches::bench_horizontal_filter                                         ... bench:     908,493 ns/iter (+/- 9,717)
test filter::benches::bench_separable_filter                                          ... bench:     670,450 ns/iter (+/- 17,449)
test filter::benches::bench_vertical_filter                                           ... bench:     942,569 ns/iter (+/- 6,466)

@theotherphil
Copy link
Contributor Author

@ripytide are your Master benchmarks using a 300x300 image or 500x500?

@ripytide
Copy link
Member

Whatever the master branch is currently using, which looks like 300x300

@theotherphil
Copy link
Contributor Author

Ah, so your master and 0.25 benchmarks aren’t comparing like for like as I confusingly switched from 500x500 to 300x300 images when I added the 5x5 and 7x7 filter benchmarks.

@theotherphil
Copy link
Contributor Author

theotherphil commented May 27, 2024

After reverting the use of filter_pixel, the current implementation of filter_clamped is still 3x slower than a simple hardcoded implementation for grayscale images and 3x3 kernels: #658

@theotherphil
Copy link
Contributor Author

Benchmarks added. Created #664 to fix the performance issues with the current filter implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants