LINQ Style Query Commands

Sorting

The @orderby statement sorts the elements from a source by one or more element attributes. The syntax for the @orderby statement is @orderby <attribute>[, <attribute>]. <attribute> can be any julia expression that returns an attribute by which the source elements should be sorted. The default sort order is ascending. By wrapping an <attribute> in a call to descending(<attribute) one can reverse the sort order. The @orderby statement accepts multiple <attribute>s separated by ,s. With multiple sorting attributes, the elements are first sorted by the first attribute. Elements that can't be ranked by the first attribute are then sorted by the second attribute etc.

Example

using Query, DataFrames

df = DataFrame(a=[2,1,1,2,1,3],b=[2,2,1,1,3,2])

x = @from i in df begin
    @orderby descending(i.a), i.b
    @select i
    @collect DataFrame
end

println(x)

# output

6×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 3     │ 2     │
│ 2   │ 2     │ 1     │
│ 3   │ 2     │ 2     │
│ 4   │ 1     │ 1     │
│ 5   │ 1     │ 2     │
│ 6   │ 1     │ 3     │

Filtering

The @where statement filters a source so that only those elements are returned that satisfy a filter condition. The syntax for the @where statement is @where <condition>. <condition> can be any arbitrary julia expression that evaluates to true or false.

Example

using Query, DataFrames

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

x = @from i in df begin
    @where i.age > 30. && i.children > 2
    @select i
    @collect DataFrame
end

println(x)

# output

1×3 DataFrame
│ Row │ name   │ age     │ children │
│     │ String │ Float64 │ Int64    │
├─────┼────────┼─────────┼──────────┤
│ 1   │ Sally  │ 42.0    │ 5        │

Projecting

The @select statement applies a transformation to each element of the source. The syntax for the @select statement is @select <condition>. <condition> can be any arbitrary julia expression that transforms an element from the source into the desired target format.

Example

The following example transforms each element from the source by squaring it.

using Query

data = [1,2,3]

x = @from i in data begin
    @select i^2
    @collect
end

println(x)

# output

[1, 4, 9]

One of the most common patterns in Query is to transform elements into named tuples with a @select statement. There are two ways to create a named tuples in Query: a) using the standard syntax from julia for named tuples, or b) a special syntax that only works inside Query.jl macros. This special syntax is based on curly brackets {}. An example that highlights all options of this syntax is this:

using Query, DataFrames

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

x = @from i in df begin
    @select {i.name, Age=i.age}
    @collect DataFrame
end

println(x)

# output

3×2 DataFrame
│ Row │ name   │ Age     │
│     │ String │ Float64 │
├─────┼────────┼─────────┤
│ 1   │ John   │ 23.0    │
│ 2   │ Sally  │ 42.0    │
│ 3   │ Kirk   │ 59.0    │

The elements of the new named tuple are separated by commas ,. One can specify an explicit name for an individual element of a named tuple using the = syntax, where the name of the element is specified as the left argument and the value as the right argument. If the name of the element should be the same as the variable that is passed for the value, one doesn't have to specify a name explicitly, instead the {} syntax automatically infers the name.

Flattening

One can project child elements from the elements of a source by using multiple @from statements. The nested child elements are flattened into one stream of results when multiple @from statements are used. The syntax for any additional @from statement (apart from the initial one that starts a query) is @from <range variable> in <selector>. <range variable> is the name of the range variable to be used for the child elements, and <selector> is a julia expression that returns the child elements.

Example

using DataFrames, Query

source = Dict(:a=>[1,2,3], :b=>[4,5])

q = @from i in source begin
    @from j in i.second
    @select {Key=i.first,Value=j}
    @collect DataFrame
end

println(q)

# output

5×2 DataFrame
│ Row │ Key    │ Value │
│     │ Symbol │ Int64 │
├─────┼────────┼───────┤
│ 1   │ a      │ 1     │
│ 2   │ a      │ 2     │
│ 3   │ a      │ 3     │
│ 4   │ b      │ 4     │
│ 5   │ b      │ 5     │

Joining

The @join statement combines data from two different sources. There are two variants of the statement: an inner join and a group join. The @left_outer_join statement provides a traditional left outer join option.

Inner join

The syntax for an inner join is @join <range variable> in <source> on <left key> equals <right key>. <range variable> is the name of the variable that should reference elements from the right source in the join. <source> is the name of the right source in the join operation. <left key> and <right key> are julia expressions that extract a value from the elements of the left and right source; the statement will then join on equality of these extracted values.

Example

using DataFrames, Query

df1 = DataFrame(a=[1,2,3], b=[1.,2.,3.])
df2 = DataFrame(c=[2,4,2], d=["John", "Jim","Sally"])

x = @from i in df1 begin
    @join j in df2 on i.a equals j.c
    @select {i.a,i.b,j.c,j.d}
    @collect DataFrame
end

println(x)

# output

2×4 DataFrame
│ Row │ a     │ b       │ c     │ d      │
│     │ Int64 │ Float64 │ Int64 │ String │
├─────┼───────┼─────────┼───────┼────────┤
│ 1   │ 2     │ 2.0     │ 2     │ John   │
│ 2   │ 2     │ 2.0     │ 2     │ Sally  │

Group join

The syntax for a group join is @join <range variable> in <source> on <left key> equals <right key> into <group variable>. <range variable> is the name of the variable that should reference elements from the right source in the join. <source> is the name of the right source in the join operation. <left key> and <right key> are julia expressions that extract a value from the elements of the left and right source; the statement will then join on equality of these extracted values. <group variable> is the name of the variable that will hold all the elements from the right source that are joined to a given element from the left source.

Example

using DataFrames, Query

df1 = DataFrame(a=[1,2,3], b=[1.,2.,3.])
df2 = DataFrame(c=[2,4,2], d=["John", "Jim","Sally"])

x = @from i in df1 begin
    @join j in df2 on i.a equals j.c into k
    @select {t1=i.a,t2=length(k)}
    @collect DataFrame
end

println(x)

# output

3×2 DataFrame
│ Row │ t1    │ t2    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 0     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 0     │

Left outer join

They syntax for a left outer join is @left_outer_join <range variable> in <source> on <left key> equals <right key>. <range variable> is the name of the variable that should reference elements from the right source in the join. <source> is the name of the right source in the join operation. <left key> and <right key> are julia expressions that extract a value from the elements of the left and right source; the statement will then join on equality of these extracted values. For elements in the left source that don't have any corresponding element in the right source, <range variable> is assigned the default value returned by the default_if_empty function based on the element types of <source>. If the right source has elements of type NamedTuple, and the fields of that named tuple are all of type DataValue, then an instance of that named tuple with all fields having NA values will be used.

Example

using Query, DataFrames

source_df1 = DataFrame(a=[1,2,3], b=[1.,2.,3.])
source_df2 = DataFrame(c=[2,4,2], d=["John", "Jim","Sally"])

q = @from i in source_df1 begin
    @left_outer_join j in source_df2 on i.a equals j.c
    @select {i.a,i.b,j.c,j.d}
    @collect DataFrame
end

println(q)

# output

4×4 DataFrame
│ Row │ a     │ b       │ c       │ d       │
│     │ Int64 │ Float64 │ Int64⍰  │ String⍰ │
├─────┼───────┼─────────┼─────────┼─────────┤
│ 1   │ 1     │ 1.0     │ missing │ missing │
│ 2   │ 2     │ 2.0     │ 2       │ John    │
│ 3   │ 2     │ 2.0     │ 2       │ Sally   │
│ 4   │ 3     │ 3.0     │ missing │ missing │

Grouping

The @group statement groups elements from the source by some attribute. The syntax for the group statement is @group <element selector> by <key selector> [into <range variable>]. <element selector> is an arbitrary julia expression that determines the content of the group elements. <key selector> is an arbitrary julia expression that returns the values by which the elements are grouped. A @group statement without an into clause ends a query statement, i.e. no further @select statement is needed. When a @group statement has an into clause, the <range variable> sets the name of the range variable for the groups, and further query statements can operate on these groups by referencing that range variable.

Example

This is an example of a @group statement without a into clause:

using DataFrames, Query

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,2,2])

x = @from i in df begin
    @group i.name by i.children
    @collect
end

println(x)

# output

Grouping{Int64,String}[["John"], ["Sally", "Kirk"]]

This is an example of a @group statement with an into clause:

using DataFrames, Query

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,2,2])

x = @from i in df begin
    @group i by i.children into g
    @select {Key=key(g),Count=length(g)}
    @collect DataFrame
end

println(x)

# output

2×2 DataFrame
│ Row │ Key   │ Count │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 3     │ 1     │
│ 2   │ 2     │ 2     │

Split-Apply-Combine (a.k.a. dplyr)

Query.jl provides special syntax to summarize data in a Query.Grouping as above. Summarizing here is synonymous to aggregating or collapsing the dataset over a certain grouping variable. Summarizing thus requires an aggregating function like mean, maximum, or any other function that takes a vector and returns a scalar. The special syntax is @select new_var = agg_fun(g.var), where agg_fun is your aggregation function (e.g. mean), g is your grouping, and var is the relevant column that you want to summarize.

Example

using Query, DataFrames, Statistics

df = DataFrame(name=repeat(["John", "Sally", "Kirk"],inner=[1],outer=[2]), 
     age=vcat([10., 20., 30.],[10., 20., 30.].+3), 
     children=repeat([3,2,2],inner=[1],outer=[2]),state=[:a,:a,:a,:b,:b,:b])

x = @from i in df begin
    @group i by i.state into g
    @select {group=key(g),mage=mean(g.age), oldest=maximum(g.age), youngest=minimum(g.age)}
    @collect DataFrame
end

println(x)

# output

2×4 DataFrame
│ Row │ group  │ mage    │ oldest  │ youngest │
│     │ Symbol │ Float64 │ Float64 │ Float64  │
├─────┼────────┼─────────┼─────────┼──────────┤
│ 1   │ a      │ 20.0    │ 30.0    │ 10.0     │
│ 2   │ b      │ 23.0    │ 33.0    │ 13.0     │

Range variables

The @let statement introduces new range variables in a query expression. The syntax for the range statement is @let <range variable> = <value selector>. <range variable> specifies the name of the new range variable and <value selector> is any julia expression that returns the value that should be assigned to the new range variable.

Example

using DataFrames, Query

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,2,2])

x = @from i in df begin
    @let count = length(i.name)
    @let kids_per_year = i.children / i.age
    @where count > 4
    @select {Name=i.name, Count=count, KidsPerYear=kids_per_year}
    @collect DataFrame
end

println(x)

# output

1×3 DataFrame
│ Row │ Name   │ Count │ KidsPerYear │
│     │ String │ Int64 │ Float64     │
├─────┼────────┼───────┼─────────────┤
│ 1   │ Sally  │ 5     │ 0.047619    │