Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve algorithm to count digits in Long #413

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

Egorand
Copy link

@Egorand Egorand commented Nov 14, 2024

Copies the PR merged into Okio: square/okio#1548.

The algorithm is based on "Down Another Rabbit Hole" by Romain Guy.

TLDR: this algorithm improves the performance of calculating the number of digits in a Long number by 40%, based on Romain's benchmarks.

@@ -135,6 +107,34 @@ public fun Sink.writeDecimalLong(long: Long) {
}
}

private fun countDigitsIn(v: Long): Int {
val guess = ((64 - v.countLeadingZeroBits()) * 10) ushr 5
return guess + (if (v > DigitCountToLargestValue[guess]) 1 else 0)
Copy link
Collaborator

@fzhinkin fzhinkin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC from the time I read Romain's blogpost, by extending DigitCountToLargestValue's length to the next power of two (32 in this case) and replacing DigitCountToLargestValue[guess] with DigitCountToLargestValue[guess.and(0x1f)] you can win a few extra percents of performance on JVM (as it should optimize out bounds checks performed on array access).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DigitCountToLargestValue is actually slightly different than the table used in the blogpost:

private val PowersOfTen = longArrayOf(
    0,
    10,
    100,
    1000,
    10000,
    100000,
    1000000,
    10000000,
    100000000,
    1000000000,
    10000000000,
    100000000000,
    1000000000000,
    10000000000000,
    100000000000000,
    1000000000000000,
    10000000000000000,
    100000000000000000,
    1000000000000000000
)

The main reason is that the original table doesn't work when the input is Long.MAX_VALUE, as it's bigger than 10^18 (last value in the array), but 10^19 is outside of the Long range.

I wonder if the one in the PR performs better? Worth benchmarking them against each other?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that loads from DigitCountToLargestValue table are compiled into a code that checks if an index is within array's bounds before performing a load.
However, if compiler can prove that indices are always in bounds, it'll abstain from generating the check.
By expanding the table to have a power-of-two length (and filling meaningless cells with, let's say, -1) and then explicitly truncating index's most significant bits (i.e., dividing an index by table's length and taking the remainder), we can hint a compiler that a value is always in bounds and it'll generate faster code: https://gist.github.com/fzhinkin/42997a2cfc18a437f88e9c31bef969c9

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I checked and on Android the power-of-two array + truncation doesn't remove the bounds check. It just adds an extra instruction. See https://godbolt.org/z/jdTzMcxbf

@fzhinkin
Copy link
Collaborator

@Egorand thanks for opening the PR!

@fzhinkin
Copy link
Collaborator

We have a benchmark on writeDecimalLong performance (this one), but it writes the same value over and over again, so the old implementation might have an advantage.

So I drafted a benchmark that writes a pack of different values:

@State(Scope.Benchmark)
open class DecimalLongWriteOnlyBenchmark : BufferRWBenchmarkBase() {
    val rng = Random(42)
    val limits = longArrayOf(
        0L,
        10L,
        100L,
        1000L,
        10000L,
        100000L,
        1000000L,
        10000000L,
        100000000L,
        1000000000L,
        10000000000L,
        100000000000L,
        1000000000000L,
        10000000000000L,
        100000000000000L,
        1000000000000000L,
        10000000000000000L,
        100000000000000000L,
        1000000000000000000L,
        Long.MAX_VALUE
    )

    // TODO: It might be better to have values following Zipf-distribution
    val values = (1 ..< limits.size).asSequence()
        .flatMap {
            val lb = limits[it - 1]
            val up = limits[it]

            generateSequence { rng.nextLong(lb, up) }.take(10)
        }
        .toList()
        .shuffled(rng)
        .toLongArray()

    override fun padding(): ByteArray {
        return with(Buffer()) {
            for (value in values) {
                writeDecimalLong(value)
                writeByte(' '.code.toByte())
            }
            readByteArray()
        }
    }

    @Benchmark
    fun benchmark() {
        val sz = buffer.size
        for (value in values) {
            buffer.writeDecimalLong(value)
            buffer.writeByte(' '.code.toByte())
        }
        buffer.skip(buffer.size - sz)
    }
}

For some reason, code using the old implementation (from the develop) outperforms code using the new one (from this PR); results collected on MacBook w/ AS M3 CPU, JDK 17.0.12:

# results for the benchmark built from develop branch
Benchmark                                (minGap)   Mode  Cnt       Score      Error  Units
DecimalLongWriteOnlyBenchmark.benchmark       128  thrpt   15  387634.472 ± 2489.095  ops/s
# result for the benchmark built from this PR:
Benchmark                                (minGap)   Mode  Cnt       Score     Error  Units
DecimalLongWriteOnlyBenchmark.benchmark       128  thrpt   15  362477.693 ± 869.341  ops/s

It's worth checking what's causing the regression.

@Egorand
Copy link
Author

Egorand commented Nov 15, 2024

For some reason, code using the old implementation (from the develop) outperforms code using the new one (from this PR)

That's interesting! @romainguy - wonder if you could share your benchmarks for comparison, and whether you have thoughts on what could be causing the results.

I'll find some time to dig deeper and investigate!

@romainguy
Copy link

I don't have the original benchmark but it wasn't done on JVM but on Android, so different runtime and hardware. However I used a dataset with a zipf distribution to be somewhat realistic and avoid favoring well predicted branches.

@fzhinkin's trick is something I've used in the past (it works great in C++ but for other reasons) and it's definitely worth a try.

@fzhinkin
Copy link
Collaborator

For some reason, code using the old implementation (from the develop) outperforms code using the new one (from this PR); results collected on MacBook w/ AS M3 CPU, JDK 17.0.12:

Worth mentioning, on an Android device, a version from this PR performs slightly better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants