Subqueries
While there are many different types of subqueries, this document so far is just highlighting some common scenarios that may not have obvious implementations in superdb.
Correlated Subqueries
Let’s start with this simple dataset:
```json lines {“id”:1, “date”:”2025-02-27”, “foo”: 3} {“id”:2, “date”:”2025-02-27”, “foo”: 2} {“id”:3, “date”:”2025-02-28”, “foo”: 5} {“id”:4, “date”:”2025-02-28”, “foo”: 9}
And we want to select the entries with the largest `foo` for each date.
One way to do this in SQL looks like this:
```sql
select id, date, foo
from data
where (date, foo) in
(select date, max(foo) as max_foo
from data
group by date);
Another way, by joining a derived table:
select *
from data d
join
(select date, max(foo) as max_foo
from data
group by date) max_foo
on d.date = max_foo.date and
d.foo = max_foo.max_foo;
In super this can be done with piped operators and a Lateral Subquery:
```mdtest-input data.json {“id”:1, “date”:”2025-02-27”, “foo”: 3} {“id”:2, “date”:”2025-02-27”, “foo”: 2} {“id”:3, “date”:”2025-02-28”, “foo”: 5} {“id”:4, “date”:”2025-02-28”, “foo”: 9}
Here's an example using the piped `where` operator with `unnest ... into`:
```mdtest-command
super -s -c '
collect(this)
| {data: this}
| maxes:=[unnest this.data | foo:=max(foo) by date | values {date,foo}]
| unnest {this.maxes, this.data} into (
where {this.data.date, this.data.foo} in this.maxes
| values this.data
)' data.json
{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}
And a simpler example using the piped join operator with from:
super -s -c '
from data.json
| inner join (from data.json
| foo:=max(foo) by date
| values {date,foo})
on {left.date,left.foo}={right.date,right.foo}
| values left
| sort id'
{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}
super also supports SQL syntax, and these subqueries work1:
super -s -c '
select *
from "data.json"
where foo in (select max(foo), date
from "data.json"
group by date) '
{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}
If we save off the max data to a file first, then we can start to see how this could look:
super -s -c '
select max(foo) as max_foo
from "data.json"
group by date' > max.sup
super -s -c '
select l.id, l.date, l.foo
from "data.json" l
join "max.sup" r
on l.foo==r.max_foo'
{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}
Fork-and-Join: A Streamable Alternative
The lateral subquery approach above uses collect to buffer the entire input into a single value before iterating. This works well for small datasets, but collect has limits on how large a single value can be. For larger datasets, a fork-and-join approach avoids that limitation by keeping things streamable.
The idea is a self-join: raw data on one side, aggregated data on the other, joined on the matching fields.
super -s -c '
from data.json
| inner join (
from data.json
| foo := max(foo) by date
) on {left.date, left.foo}={right.date, right.foo}
| values left
| sort date'
{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}
This can also use fork to read the input once instead of naming the file twice:
super -s -c '
from data.json
| fork
( pass )
( foo := max(foo) by date )
| inner join on {left.date, left.foo}={right.date, right.foo}
| values left
| sort date'
{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}
With fork, the data flows through a single unnamed input — one branch passes records through, the other aggregates. The multi-field join key uses the {left.x, left.y}={right.x, right.y} record syntax (see multi-value joins).
The tradeoff: fork-and-join is more verbose, but it avoids the collect size limit and works with streaming pipelines.
Subquery with Related Data Join
A more realistic scenario: find the records with the top score per date, and also pull in user information from a related table.
```mdtest-input scores.json {“id”:1, “date”:”2025-02-27”, “score”: 3, “user_id”: 101} {“id”:2, “date”:”2025-02-27”, “score”: 2, “user_id”: 102} {“id”:3, “date”:”2025-02-28”, “score”: 5, “user_id”: 101} {“id”:4, “date”:”2025-02-28”, “score”: 9, “user_id”: 103}
```mdtest-input users.json
{"user_id": 101, "name": "Moxie"}
{"user_id": 102, "name": "Ziggy"}
{"user_id": 103, "name": "Sprocket"}
First, the basic join returns all records with user names:
super -s -c '
select s.id, s.date, s.score, s.user_id, u.name
from "scores.json" s
join "users.json" u on s.user_id = u.user_id
order by s.id'
{id:1,date:"2025-02-27",score:3,user_id:101,name:"Moxie"}
{id:2,date:"2025-02-27",score:2,user_id:102,name:"Ziggy"}
{id:3,date:"2025-02-28",score:5,user_id:101,name:"Moxie"}
{id:4,date:"2025-02-28",score:9,user_id:103,name:"Sprocket"}
Filtering to top scores per date using a subquery:
super -s -c '
select *
from "scores.json"
where score in (select max(score), date
from "scores.json"
group by date)'
{id:1,date:"2025-02-27",score:3,user_id:101}
{id:4,date:"2025-02-28",score:9,user_id:103}
The “obvious” SQL approach with tuple comparison returns empty — this is a known issue (#6326):
super -s -c '
select s.id, s.date, s.score, s.user_id, u.name
from "scores.json" s
join "users.json" u on s.user_id = u.user_id
where (s.date, s.score) in (
select date, max(score)
from "scores.json"
group by date)'
A derived table approach (subquery in FROM) does work:
super -s -c '
select s.id, s.date, s.score, s.user_id, u.name
from "scores.json" s
join (
select date, max(score) as max_score
from "scores.json"
group by date
) m on s.date = m.date and s.score = m.max_score
join "users.json" u on s.user_id = u.user_id
order by s.id'
{id:1,date:"2025-02-27",score:3,user_id:101,name:"Moxie"}
{id:4,date:"2025-02-28",score:9,user_id:103,name:"Sprocket"}
The piped approach also works — filter first, then join to get usernames:
super -s -c '
from "scores.json"
| where score in (select max(score), date from "scores.json" group by date)
| inner join (from "users.json") on left.user_id=right.user_id
| select left.id, left.date, left.score, right.name
| sort id'
{id:1,date:"2025-02-27",score:3,name:"Moxie"}
{id:4,date:"2025-02-28",score:9,name:"Sprocket"}
as of versions
super --version
Version: v0.2.0
-
SQL subqueries that reference files re-read the file for each subquery, which increases CPU usage and wall time compared to the piped approach. ↩