Copy & Paste

an unoriginal site
A curated content list

jmiguel

9-Minute Read

From: Richard Startin’s Blog

Counting over Range Predicates

Mar 27, 2022 | javaroaringRSS

This post follows on from my last post about selecting objects satisfying a range predicate, and instead looks at how to count the objects. If you can select objects, you can count them too, but it’s a simpler problem so resources can be saved with a specialised solution.

Suppose you want to count how many of the Transaction objects below satisfy complex filters, and don’t care about details about the objects themselves.

  public static final class Transaction {
    private final int quantity;
    private final long price;
    private final long timestamp;

    public Transaction(int quantity, long price, long timestamp) {
      this.quantity = quantity;
      this.price = price;
      this.timestamp = timestamp;
    }

    public int getQuantity() {
      return quantity;
    }

    public long getPrice() {
      return price;
    }

    public long getTimestamp() {
      return timestamp;
    }
  }

Once again, this sounds like a task for a database, but imagine you don’t have one to rely on.

You might use the streams API:

    transactions.stream()
    .filter(transaction -> transaction.quantity >= qty && transaction.price <= price
                    && transaction.timestamp >= begin && transaction.timestamp <= end)
            .count();

This will inspect every transaction to check if it matches the constraints, and will count the number of matches. For 1M transactions, with parameters which select about 250 transactions, this takes about 10ms on my laptop:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: minPrice Param: minQuantity Param: size
stream1 avgt 1 5 9721.806835 27.935884 us/op 100 1 1000000

It took less time to actually select the objects, which, without any knowledge of the implementation, suggests that the objects are probably selected in order to be counted. The transactions are the same as those in the previous post, and sorted by time. This means reordering the conditions to the time filter is applied first is as effective as it was last time by reducing the number of branch misses.

    transactions.stream()
    .filter(transaction -> transaction.timestamp >= begin && transaction.timestamp <= end
                    && transaction.quantity >= qty && transaction.price <= price)
            .count();

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: minPrice Param: minQuantity Param: size
stream1 avgt 1 5 9721.806835 27.935884 us/op 100 1 1000000
stream2 avgt 1 5 3283.291345 44.566458 us/op 100 1 1000000

Something easy to ignore in a Stream API benchmark is how much better the performance is when there is only one implementation of the filter, which allows the filter to be inlined. This makes the code in the benchmark artificially simple, when really the whole point of the API is that many filters would be used, which is a triumph of modularity over efficiency (but you should use the right tool for the job and know when such a difference matters). Artificially polluting the filter in the benchmark setup shows how artificial the result for stream2 was:

    @Setup(Level.Trial)
    public void setup() {
      super.setup();
      transactions.stream()
          .filter(transaction -> transaction.quantity < 0 && transaction.price > 1
              && transaction.timestamp >= 1 && transaction.timestamp <= Long.MAX_VALUE - 1)
          .count();
      transactions.stream()
          .filter(transaction -> transaction.quantity < 10 && transaction.price > 1000
              && transaction.timestamp >= -1 && transaction.timestamp <= Long.MAX_VALUE - 1)
          .count();
    }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: minPrice Param: minQuantity Param: size
stream1 avgt 1 5 9721.806835 27.935884 us/op 100 1 1000000
stream2 avgt 1 5 3283.291345 44.566458 us/op 100 1 1000000
stream2Polluted avgt 1 5 7801.823477 41.229256 us/op 100 1 1000000

Results

Inlining or no inlining, it’s inefficient to use linear search if the transactions are sorted, which they are (by time). As far as I’m aware there’s no way to communicate to the Stream API that a collection is already sorted and can be skipped over, and it doesn’t know how to crack filter predicates to exploit this anyway. So avoiding the Stream API and writing old fashioned Java code is effective. In the benchmark data, the time range predicate selects about 10% of the data, so the other two predicates should only apply to this much data. Assume the timestamps are unique for simplicity’s sake:

    int first = Collections.binarySearch(transactions, new Transaction(0, 0, begin),
            Comparator.comparingLong(Transaction::getTimestamp));
    first = first < 0 ? -first - 1 : first;
    int last = Collections.binarySearch(transactions, new Transaction(0, 0, end),
            Comparator.comparingLong(Transaction::getTimestamp));
    last = last < 0 ? -last - 1 : last;
    long count = 0;
    for (int i = first; i <= last; i++) {
      Transaction transaction = transactions.get(i);
      if (transaction.quantity >= qty && transaction.price <= price) {
        count++;
      }
    }

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: minPrice Param: minQuantity Param: size
stream1 avgt 1 5 9721.806835 27.935884 us/op 100 1 1000000
stream2 avgt 1 5 3283.291345 44.566458 us/op 100 1 1000000
stream2Polluted avgt 1 5 7801.823477 41.229256 us/op 100 1 1000000
binarySearch avgt 1 5 786.482285 1.994657 us/op 100 1 1000000

Sorting is not included in the benchmark time, but data is often naturally sorted by time. It’s not a lot more code but goes a lot faster, and easier to measure because it’s so simple. JMH’s perfnorm profiler shows that there is still a good number of branches missed, despite shedding about 90% of the work. This means the instructions per cycle (IPC) is low:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: minPrice Param: minQuantity Param: size
binarySearch avgt 1 5 786.482285 1.994657 us/op 100 1 1000000
binarySearch:IPC avgt 1 1 0.755879 NaN insns/clk 100 1 1000000
binarySearch:branch-misses avgt 1 1 50739.339319 NaN #/op 100 1 1000000
binarySearch:branches avgt 1 1 434330.541115 NaN #/op 100 1 1000000

This can be fixed by complicating the filter in the scan somewhat.

    long count = 0;
    for (int i = first; i <= last; i++) {
      Transaction transaction = transactions.get(i);
      count += (Math.min(1, Math.max(transaction.quantity - qty, 0))
        + Math.min(1, Math.max(price - transaction.price, 0))) >>> 1;
    }

If the transaction’s quantity is greater than the threshold, Math.max(transaction.quantity - threshold, 0) will be positive, wrapping this in Math.min(1, x) means that the count will be one if the predicate holds and zero if it doesn’t. This can be added to the other condition and divided by two so the count will be incremented by one when both conditions hold, and by zero when they don’t.

Math.max and Math.min are special and get compiled to conditional move instructions which aren’t speculated so can’t miss. They are annotated as @IntrinsicCandidate:

    @IntrinsicCandidate
    public static int max(int a, int b) {
        return (a >= b) ? a : b;
    }

    @IntrinsicCandidate
    public static int min(int a, int b) {
        return (a <= b) ? a : b;
    }

This means if you have a hot loop, they should be preferred to inlining the equivalent logic, which may or may not be compiled to the same code. They’re also more readable, in my opinion. The branch misses basically disappear, and the instructions per cycle is now much higher:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: minPrice Param: minQuantity Param: size
binarySearch avgt 1 5 786.482285 1.994657 us/op 100 1 1000000
binarySearch:IPC avgt 1 1 0.755879 NaN insns/clk 100 1 1000000
binarySearch:branch-misses avgt 1 1 50739.339319 NaN #/op 100 1 1000000
binarySearch:branches avgt 1 1 434330.541115 NaN #/op 100 1 1000000
binarySearchBranchFreeScan avgt 1 5 299.725900 2.414697 us/op 100 1 1000000
binarySearchBranchFreeScan:IPC avgt 1 1 3.278388 NaN insns/clk 100 1 1000000
binarySearchBranchFreeScan:branch-misses avgt 1 1 524.965594 NaN #/op 100 1 1000000
binarySearchBranchFreeScan:branches avgt 1 1 403042.075820 NaN #/op 100 1 1000000

However, I am fairly certain that no analytical database written in Java implements counting over predicates like this. This is because fusing the predicate with aggregation is hostile to modularity, though generating code specialised to the problem is a good option if the implementation needs to be both fast and modular. It also won’t work with arbitrary numbers of predicates and needs to be padded to the next power of 2 so a matching result can be shifted down to one or zero in order to avoid integer division. If the JIT compiler were smarter and could inline the predicate evaluation into the count operator, I imagine it could transform modular code into this form, but it would probably get it wrong as often as it gets it right. So this is probably an artificially good result, but it is the best so far.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: minPrice Param: minQuantity Param: size
stream1 avgt 1 5 9721.806835 27.935884 us/op 100 1 1000000
stream2 avgt 1 5 3283.291345 44.566458 us/op 100 1 1000000
stream2Polluted avgt 1 5 7801.823477 41.229256 us/op 100 1 1000000
binarySearch avgt 1 5 786.482285 1.994657 us/op 100 1 1000000
binarySearchBranchFreeScan avgt 1 5 299.725900 2.414697 us/op 100 1 1000000

Results

In my last post I used RangeBitmap - which is the data structure used in Apache Pinot‘s range index - to select transactions matching the predicate. Combined with binary search to apply the time range filter, it was much faster than any other implementation I measured.

I have recently implemented the ability to count values matching a predicate, instead of producing a bitmap of their offsets. The API is symmetric with the selection API:

  long index(int qty, long price, long begin, long end) {
    RoaringBitmap inTimeRange = timestampIndex.between(begin, end);
    RoaringBitmap matchesQuantity = quantityIndex.gte(qty, inTimeRange);
    return priceIndex.lteCardinality(price, matchesQuantity);
  }

  long binarySearchThenIndex(int qty, long price, long begin, long end) {
    int first = Collections.binarySearch(transactions, new Transaction(0, 0, begin),
            Comparator.comparingLong(Transaction::getTimestamp));
    int last = Collections.binarySearch(transactions, new Transaction(0, 0, end),
            Comparator.comparingLong(Transaction::getTimestamp));
    RoaringBitmap inTimeRange = RoaringBitmap.bitmapOfRange(first, last + 1);
    RoaringBitmap matchesQuantity = quantityIndex.gte(qty, inTimeRange);
    return priceIndex.lteCardinality(price, matchesQuantity);
  }

RangeBitmap is likely to be suboptimal for counting because the data structure is optimised for producing a sorted output set, which isn’t required for counting, but if you have a data stucture to support selection it ought to be able to provide a count without materialising intermediate results. For selection, just using an index for all three attributes and combining the results was a lot faster than binarySearch but for counting it’s a little disappointing at the moment. I haven’t tried to optimise the counting methods yet so there is probably a lot of low-hanging fruit. Cutting the range down with the binary search first produces a good result, even better than the hard to modularise binarySearchBranchFreeScan:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: minPrice Param: minQuantity Param: size
stream1 avgt 1 5 9721.806835 27.935884 us/op 100 1 1000000
stream2 avgt 1 5 3283.291345 44.566458 us/op 100 1 1000000
stream2Polluted avgt 1 5 7801.823477 41.229256 us/op 100 1 1000000
binarySearch avgt 1 5 786.482285 1.994657 us/op 100 1 1000000
binarySearchBranchFreeScan avgt 1 5 299.725900 2.414697 us/op 100 1 1000000
index avgt 1 5 666.951907 14.861610 us/op 100 1 1000000
binarySearchThenIndex avgt 1 5 78.191544 17.903472 us/op 100 1 1000000

Results

RangeBitmap was designed for use in Apache Pinot, so is compressed and supports zero-copy mapping to and from disk. There are more details about how it works in depth in RangeBitmap - How range indexes work in Apache Pinot.

The benchmark used in this post is not particularly scientific but is here. If you want to run it, you will get different numbers, but the rank of each benchmark score should not change.


Original article: https://richardstartin.github.io/posts/range-counts

Richard Startin’s Blog

Recent Posts

Categories

About

This place is just where I copy and paste articles I find interesting and can helpful for me in the future. No original content here!