2023-03-14
Pete Tucker, Kristin Tufte, Vassilis Papadimos, David Maier [Tuc+02].
I rate this 6.2.
README.md but formatted in beautiful unreadable dead-tree sized extra-hyphenated LaTeX and a few single-use acronyms (SUA)s to keep it confusing.
XMark measures XML format. Presenting Niagara Extension to XMark (NEXMark).
EBay scenario. New people registering, new items submitted for auction, bids continuously arriving for items. Static files on disk for category information.
Changes to XMark:
bids are absolute, not relative increases.auction now has no closing price, a new closed_auction does.closed_item only contains closing date and buyerid.stream-in, stream-out queries. Continuous queries that take a stream as an input and output a stream.
NOT addressed by this benchmark: triggers, ad-hoc queries.
8 queries, Q5 to Q8 are window queries.
Test processing and parse speed of the stream system. Reference for the rest.
I would call this a filter. Outputs itemid and price where itemid is one of 5 specific numbers.
Test join functionality. Joins a stream of new items to people, outputs when certain conditions met. Watch out for closed auctions.
Join static category file to closed_auction stream, calculate average closing price for each.
Time-based, sliding window, group-by operation. Every minute outputs the item with the most bids in the last hour.
Event-based, sliding window. For each closing auction, output average selling price of last 10 auctions done by seller.
Time-based, fixed window. Every 10 minutes, return highest bid and itemid in the most recent 10 minutes.
Time-based, sliding window. People who opened an auction in last 12 hours.
Work in progress? Does what you think, knobs to turn for each stream.
Unbounded queries never terminate, time can’t be used.
Create output of ideal system (zero latency, perfect accuracy) offline, measure and compare online latency and results.
In nexmark’s github there are some bonus queries lifted from Apache Beam, let’s see what we find.
Partitioned file system stress test.
Session windows. How many bids did a user make while active?
Time based, fixed window. Count user bids within processing time window.
Join a stream to a bounded side input.
Complex projection and filter.
Multiple distinct aggregations with filters. How many distinct users join at price bounds.
Multiple distinct aggregations with filters for multiple keys. Distinct users join bidding for different levels for a channel.
Unbounded group aggregation. Bids and price on an auction made each day.
Deduplicate query, last bid for each bidder.
TOP-N bids per auction.
Filter join. Get bids for specific category.
Regex? Parser? Disk? Not sure what this wants to stress, doesn’t feel like a window.
SPLIT_INDEX speed test? Not sure, not what I’m looking for though.