Subqueries

While there are many different types of subqueries, this document so far is just highlighting some common scenarios that may not have obvious implementations in superdb.

Correlated Subqueries

Let’s start with this simple dataset:

{"id":1, "date":"2025-02-27", "foo": 3}
{"id":2, "date":"2025-02-27", "foo": 2}
{"id":3, "date":"2025-02-28", "foo": 5}
{"id":4, "date":"2025-02-28", "foo": 9}

And we want to select the entries with the largest foo for each date.

One way to do this in SQL looks like this:

select id, date, foo
from data
where (date, foo) in
      (select date, max(foo) as max_foo
       from data
       group by date);

Another way, by joining a derived table:

select *
from data d
       join
     (select date, max(foo) as max_foo
      from data
      group by date) max_foo
     on d.date = max_foo.date and
        d.foo = max_foo.max_foo;

In super this can be done with piped operators and a Lateral Subquery:

{"id":1, "date":"2025-02-27", "foo": 3}
{"id":2, "date":"2025-02-27", "foo": 2}
{"id":3, "date":"2025-02-28", "foo": 5}
{"id":4, "date":"2025-02-28", "foo": 9}

Here’s an example using the piped where operator with unnest ... into:

super -s -c '
  collect(this)
  | {data: this}
  | maxes:=[unnest this.data | foo:=max(foo) by date | values {date,foo}]
  | unnest {this.maxes, this.data} into (
      where {this.data.date, this.data.foo} in this.maxes
      | values this.data
    )' data.json

{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}

And a simpler example using the piped join operator with from:

super -s -c '
  from data.json
  | inner join (from data.json
                | foo:=max(foo) by date
                | values {date,foo})
    on {left.date,left.foo}={right.date,right.foo}
  | values left
  | sort id'

{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}

super also supports SQL syntax, and these subqueries work¹:

super -s -c '
  select *
  from "data.json"
  where foo in (select max(foo), date
                from "data.json"
                group by date) '

{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}

If we save off the max data to a file first, then we can start to see how this could look:

super -s -c '
  select max(foo) as max_foo
  from "data.json"
  group by date' > max.sup

super -s -c '
  select l.id, l.date, l.foo
  from "data.json" l
    join "max.sup" r
    on l.foo==r.max_foo'

{id:1,date:"2025-02-27",foo:3}
{id:4,date:"2025-02-28",foo:9}

A more realistic scenario: find the records with the top score per date, and also pull in user information from a related table.

{"id":1, "date":"2025-02-27", "score": 3, "user_id": 101}
{"id":2, "date":"2025-02-27", "score": 2, "user_id": 102}
{"id":3, "date":"2025-02-28", "score": 5, "user_id": 101}
{"id":4, "date":"2025-02-28", "score": 9, "user_id": 103}

{"user_id": 101, "name": "Moxie"}
{"user_id": 102, "name": "Ziggy"}
{"user_id": 103, "name": "Sprocket"}

First, the basic join returns all records with user names:

super -s -c '
  select s.id, s.date, s.score, s.user_id, u.name
  from "scores.json" s
    join "users.json" u on s.user_id = u.user_id
  order by s.id'

{id:1,date:"2025-02-27",score:3,user_id:101,name:"Moxie"}
{id:2,date:"2025-02-27",score:2,user_id:102,name:"Ziggy"}
{id:3,date:"2025-02-28",score:5,user_id:101,name:"Moxie"}
{id:4,date:"2025-02-28",score:9,user_id:103,name:"Sprocket"}

Filtering to top scores per date using a subquery:

super -s -c '
  select *
  from "scores.json"
  where score in (select max(score), date
                  from "scores.json"
                  group by date)'

{id:1,date:"2025-02-27",score:3,user_id:101}
{id:4,date:"2025-02-28",score:9,user_id:103}

The “obvious” SQL approach with tuple comparison returns empty — this is a known issue (#6326):

super -s -c '
  select s.id, s.date, s.score, s.user_id, u.name
  from "scores.json" s
    join "users.json" u on s.user_id = u.user_id
  where (s.date, s.score) in (
    select date, max(score)
    from "scores.json"
    group by date)'

A derived table approach (subquery in FROM) does work:

super -s -c '
  select s.id, s.date, s.score, s.user_id, u.name
  from "scores.json" s
    join (
      select date, max(score) as max_score
      from "scores.json"
      group by date
    ) m on s.date = m.date and s.score = m.max_score
    join "users.json" u on s.user_id = u.user_id
  order by s.id'

{id:1,date:"2025-02-27",score:3,user_id:101,name:"Moxie"}
{id:4,date:"2025-02-28",score:9,user_id:103,name:"Sprocket"}

The piped approach also works — filter first, then join to get usernames:

super -s -c '
  from "scores.json"
  | where score in (select max(score), date from "scores.json" group by date)
  | inner join (from "users.json") on left.user_id=right.user_id
  | select left.id, left.date, left.score, right.name
  | sort id'

{id:1,date:"2025-02-27",score:3,name:"Moxie"}
{id:4,date:"2025-02-28",score:9,name:"Sprocket"}

as of versions

super --version

Version: v0.2.0

SQL subqueries that reference files re-read the file for each subquery, which increases CPU usage and wall time compared to the piped approach. ↩

Subqueries

Correlated Subqueries

Subquery with Related Data Join

as of versions