Moar Subqueries

Collect-First Pattern (“Go Up Before Drilling Down”)

A common problem: you need to both aggregate the full dataset AND filter it based on those aggregation results. But SuperDB streams data — once it’s consumed, it’s gone.

The collect-first pattern solves this by buffering everything into a single record, then using lateral subqueries to derive summaries while keeping access to all the data:

from data.json
| collect(this) | {data: this}
| put top_ten := [
    unnest data
    | aggregate count := count() by table
    | sort -r count
    | head 10
    | values table
  ]
| unnest data
| where table in top_ten
| aggregate count := count() by table, bucket(ts, 1h)
| sort table, bucket

The idea: collect everything first (“go up”), derive what you need (top ten tables), then drill back down into the raw data using those results as a filter.

Tradeoff: This buffers the entire dataset into memory. For large datasets, consider the fork-and-join approach from Subqueries instead, which stays streamable.

Fork

One hassle to this approach is the limit of 2 forks. Nesting forks works, but makes constructing this query a bit more difficult.

Full Sub-Selects

Much slower than pipe-style subqueries because the data file gets re-read each time.

select
(select count(*)
from './moar_subqueries.sup'
where win is not null) as total_games,
...

All Other SQL-Syntax Subqueries

They all take about the same amount of wall-time, but CPU usage is much higher due to re-reading the file each time.


This site uses Just the Docs, a documentation theme for Jekyll.