Query.jl v0.7.x released

I just released Query.jl version v0.7.1. The v0.7.x series includes a number of smaller improvements and some major new experimental features. This post describes and explains all new features included in this release.

Enable {} everywhere

{} is the syntax for named tuples in Query.jl. In previous versions this syntax only worked in @select statements, and it only worked at the top level (for example you couldn’t create a named tuple that has a field that is itself a named tuple). The new release now enables the {} syntax everywhere in queries.

This is especially handy for @group statements. Take the following query as an example:

using DataFrames, Query

df = DataFrame(sex=[:male, :female, :female, :female], age=[21., 30., 45., 34.], children=[2,2,1,2])

@from i in df begin
    @group i.age by {i.sex, i.children} into g
    @select {g.key.sex, g.key.children, age=mean(g)}
    @collect DataFrame
end

Here we have data about four individuals: their sex, their age and how many children they have. We then group this data by sex and children, and compute the average age for each group. The expression by {i.sex, i.children} creates a named tuple as the grouping key, which is handy later on, because we can now access the individual fields of the group key by their name.

Experimental standalone commands

I also added number of new standalone commands that enable a user experience that is inspired by both the method syntax of the LINQ query operators and dplyr. I am still tinkering with the precise details of this API, so things might change going forward (hence “experimental”).

Here is a code example that highlights some of the new features and how they play together with some of the other packages that I’ve created over the last year:

using Dataverse

df = load("http://www.david-anthoff.com/blog/data/exampledata.csv") |>
    @groupby(_.Children) |>
    @select({Children=_.key,Age=mean(_..Age)}) |>
    @orderby(_.Children) |>
    @tee(save("output.csv")) |>
    @where(_.Age>30) |>
    @tee(save("output.feather")) |>
    DataFrame

First, this uses the unregistered Dataverse.jl package (I’m still looking for a better name, ideas welcome!). That package pulls together a set of data packages that play nicely together and adds a number of small experimental features on top of Query.jl.

The query starts out by loading a CSV file from a URL. The load function is from the FileIO.jl package. You can use that function to load files from web addresses or local disc, and it works for CSV, Feather, Excel, Stata, SPSS and SAS files.

The next thing to note is that this query uses the pipe operator |> to build up a data processing pipeline. Essentially you pass your data through a series of manipulation commands and chain those together via the |> operator.

The @groupby macro call is the first new standalone query command introduced in this version of Query.jl. The argument to the @groupby macro is an anonymous function that selects the key by which the source data should be grouped. The normal julia syntax for this would be @groupby(i->i.Children). In the example above I use another experimental syntax in the new Query.jl version: when you write an expression that contains the _ symbol, that expression is translated into _ -> your_original_expression. This syntax is just a short-cut to writing anonymous functions. In the context of querying table-like sources, it typically stands for the current row in query commands. The syntax _.Children here therefor extracts the value of the Children column for each row, and rows are grouped by that value.

The next command is a @select macro call. It takes an anonymous function that projects each element from the source stream. The @groupby macro creates a stream of groups, so each element that the @select macro sees will be an element of type Grouping. A Grouping element has one field key that holds the value of the group key for that group. In this example I am accessing that value via the _.key expression in the @select call. Any Grouping element is also at the same time an array of the elements that were grouped. In our example, the Grouping element is an array of the rows that make up a single group. In this example I want to compute the average age for each group. The problem here is that _ will be an array of rows, not an array of the age column for each group. But I need to pass just a vector of values to the mean function, not a vector of rows. The standard julia way to achieve this would be to use a generator expression like mean(i.Age for i in _). Because these kind of summary operations for groups are so common in data analysis, I also added another new experimental syntax to Query.jl in this version that simplifies the syntax for this common pattern. The syntax a..b is a shortcut for map(i->i.b,a) in this version, and that enables the concise syntax mean(_..Age) that is used in our example.

The next command @orderby is relatively simple: it takes an anonymous function that extracts the key by which things should be sorted from each source element. So in this example we are sorting the dataset by the Children column at this stage.

The @tee macro is another experimental feature. It mimics the tee shell command. It takes the thing that is piped into it and pipes it into both the argument of the @tee macro, and then passes the input unmodified to the next stage in the general pipeline. In the example I use this to save the intermediate dataset we have created so far into a CSV file, but then pass the data on to further data manipulation commands.

The save function is once again from the FileIO.jl package. It currently supports saving tabular data as CSV and Feather files.

The @where command filters the source dataset: only elements for which the anonymous function that is passed to the @where command returns true are passed on to the next stage. I am using the _ syntax once again here to create the anonymous function.

The next @tee command just saves the now filtered dataset to a Feather file.

The final line of this query pipes the result into a DataFrame. Note that we never allocated or used any of the table types like DataFrame or DataTable in the query so far. We could have easily concluded the query with a save call and actually never materialized the data into a DataFrame if we just wanted to load a file, manipulate it and save it back to disc again. But if we intend to use the data for further in-memory manipulation it is of course convenient to store it in a DataFrame (or any of the other supported table types, see the list in IterableTables.jl).

You can find a bit more information about these new experimental features in the documentation here.

I am really interested in feedback on these new features! They are obviously not complete at this point, but I hope you can get a general feel for the direction, and any comments on that and anything else would be most welcome. I track both the issues over at Query.jl and the discussion on the julia forum.

One important final point: this new syntax will eventually augment the syntax we already have in Query.jl. The @from macro will not go away! In fact, one can easily combine the two styles, for example in the following way:

using Dataverse

df = load("http://www.david-anthoff.com/blog/data/exampledata.csv") |>
    @query(i, begin
        @group i by i.Children into g
        @select {Children=g.key, Age=mean(g..Age)}
    end) |>
    @where(_.Age>30) |>
    DataFrame

Various smaller changes

Query.jl now uses the iterable tables interface definition in TableTraits.jl. This change should be entirely transparent to users, it just amounts to some reorganization of packages in the background.

The release also includes a number of performance improvements and bug fixes.

Thanks

Thanks to Florian for help with the .. syntax, and to Keno and Steven for pointing me to the @tee name.

This post is being discussed here.

Written by David Anthoff