Subqueries

While there are many different types of subqueries, this document so far is just highlighting some common scenarios that may not have obvious implementations in superdb.

Correlated Subqueries

Let’s start with this simple dataset:

```json lines {“id”:1, “date”:”2025-02-27”, “foo”: 3} {“id”:2, “date”:”2025-02-27”, “foo”: 2} {“id”:3, “date”:”2025-02-28”, “foo”: 5} {“id”:4, “date”:”2025-02-28”, “foo”: 9}

And we want to select the entries with the largest `foo` for each date.

One way to do this in SQL looks like this:

```sql
select id, date, foo
from data
where (date, foo) in
      (select date, max(foo) as max_foo
       from data
       group by date);

Another way, by joining a derived table:

select *
from data d
       join
     (select date, max(foo) as max_foo
      from data
      group by date) max_foo
     on d.date = max_foo.date and
        d.foo = max_foo.max_foo;

In super this can be done with piped operators and a Lateral Subquery:

```mdtest-input data.json {“id”:1, “date”:”2025-02-27”, “foo”: 3} {“id”:2, “date”:”2025-02-27”, “foo”: 2} {“id”:3, “date”:”2025-02-28”, “foo”: 5} {“id”:4, “date”:”2025-02-28”, “foo”: 9}

Here's an example using the piped `where` operator with `unnest ... into`:
```mdtest-command
super -s -c '
  collect(this)
  | {data: this}
  | maxes:=[unnest this.data | foo:=max(foo) by date | values {date,foo}]
  | unnest {this.maxes, this.data} into (
      where {this.data.date, this.data.foo} in this.maxes
      | values this.data
    )' data.json

{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}

And a simpler example using the piped join operator with from:

super -s -c '
  from data.json
  | inner join (from data.json
                | foo:=max(foo) by date
                | values {date,foo})
    on {left.date,left.foo}={right.date,right.foo}
  | values left
  | sort id'

{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}

super also supports SQL syntax, and these subqueries work¹:

super -s -c '
  select *
  from "data.json"
  where foo in (select max(foo), date
                from "data.json"
                group by date) '

{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}

If we save off the max data to a file first, then we can start to see how this could look:

super -s -c '
  select max(foo) as max_foo
  from "data.json"
  group by date' > max.sup

super -s -c '
  select l.id, l.date, l.foo
  from "data.json" l
    join "max.sup" r
    on l.foo==r.max_foo'

{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}

Fork-and-Join: A Streamable Alternative

The lateral subquery approach above uses collect to buffer the entire input into a single value before iterating. This works well for small datasets, but collect has limits on how large a single value can be. For larger datasets, a fork-and-join approach avoids that limitation by keeping things streamable.

The idea is a self-join: raw data on one side, aggregated data on the other, joined on the matching fields.

super -s -c '
  from data.json
  | inner join (
      from data.json
      | foo := max(foo) by date
    ) on {left.date, left.foo}={right.date, right.foo}
  | values left
  | sort date'

{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}

This can also use fork to read the input once instead of naming the file twice:

super -s -c '
  from data.json
  | fork
    ( pass )
    ( foo := max(foo) by date )
  | inner join on {left.date, left.foo}={right.date, right.foo}
  | values left
  | sort date'

{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}

With fork, the data flows through a single unnamed input — one branch passes records through, the other aggregates. The multi-field join key uses the {left.x, left.y}={right.x, right.y} record syntax (see multi-value joins).

The tradeoff: fork-and-join is more verbose, but it avoids the collect size limit and works with streaming pipelines.

A more realistic scenario: find the records with the top score per date, and also pull in user information from a related table.

```mdtest-input scores.json {“id”:1, “date”:”2025-02-27”, “score”: 3, “user_id”: 101} {“id”:2, “date”:”2025-02-27”, “score”: 2, “user_id”: 102} {“id”:3, “date”:”2025-02-28”, “score”: 5, “user_id”: 101} {“id”:4, “date”:”2025-02-28”, “score”: 9, “user_id”: 103}

```mdtest-input users.json
{"user_id": 101, "name": "Moxie"}
{"user_id": 102, "name": "Ziggy"}
{"user_id": 103, "name": "Sprocket"}

First, the basic join returns all records with user names:

super -s -c '
  select s.id, s.date, s.score, s.user_id, u.name
  from "scores.json" s
    join "users.json" u on s.user_id = u.user_id
  order by s.id'

{id:1,date:"2025-02-27",score:3,user_id:101,name:"Moxie"}
{id:2,date:"2025-02-27",score:2,user_id:102,name:"Ziggy"}
{id:3,date:"2025-02-28",score:5,user_id:101,name:"Moxie"}
{id:4,date:"2025-02-28",score:9,user_id:103,name:"Sprocket"}

Filtering to top scores per date using a subquery:

super -s -c '
  select *
  from "scores.json"
  where score in (select max(score), date
                  from "scores.json"
                  group by date)'

{id:1,date:"2025-02-27",score:3,user_id:101}
{id:4,date:"2025-02-28",score:9,user_id:103}

The “obvious” SQL approach with tuple comparison returns empty — this is a known issue (#6326):

super -s -c '
  select s.id, s.date, s.score, s.user_id, u.name
  from "scores.json" s
    join "users.json" u on s.user_id = u.user_id
  where (s.date, s.score) in (
    select date, max(score)
    from "scores.json"
    group by date)'

A derived table approach (subquery in FROM) does work:

super -s -c '
  select s.id, s.date, s.score, s.user_id, u.name
  from "scores.json" s
    join (
      select date, max(score) as max_score
      from "scores.json"
      group by date
    ) m on s.date = m.date and s.score = m.max_score
    join "users.json" u on s.user_id = u.user_id
  order by s.id'

{id:1,date:"2025-02-27",score:3,user_id:101,name:"Moxie"}
{id:4,date:"2025-02-28",score:9,user_id:103,name:"Sprocket"}

The piped approach also works — filter first, then join to get usernames:

super -s -c '
  from "scores.json"
  | where score in (select max(score), date from "scores.json" group by date)
  | inner join (from "users.json") on left.user_id=right.user_id
  | select left.id, left.date, left.score, right.name
  | sort id'

{id:1,date:"2025-02-27",score:3,name:"Moxie"}
{id:4,date:"2025-02-28",score:9,name:"Sprocket"}

as of versions

super --version

Version: v0.2.0

SQL subqueries that reference files re-read the file for each subquery, which increases CPU usage and wall time compared to the piped approach. ↩

Subqueries

Correlated Subqueries

Fork-and-Join: A Streamable Alternative

Subquery with Related Data Join

as of versions