Experimental Features

Experimental features

The following features are experimental, i.e. they might change significantly in the future. You are advised to only use them if you are prepared to deal with significant changes to these features in future versions of Query.jl. At the same time any feedback on these features would be especially welcome.

The @map, @filter, @groupby, @orderby (and various variants), @groupjoin, @join, @mapmany, @take and @drop commands can be used in standalone versions. Those standalone versions are especially convenient in combination with the pipe syntax in julia. Here is an example that demonstrates their use:

using Query, DataFrames, Statistics

df = DataFrame(a=[1,1,2,3], b=[4,5,6,8])

df2 = df |>
    @groupby(_.a) |>
    @map({a=key(_), b=mean(_.b)}) |>
    @filter(_.b > 5) |>
    @orderby_descending(_.b) |>
    DataFrame

This example makes use of three experimental features: 1) the standalone query commands, 2) the . syntax and 3) the _ anonymous function syntax.

Standalone query operators

All standalone query commands can either take a source as their first argument, or one can pipe the source into the command, as in the above example. For example, one can either write

df = df |> @groupby(_.a)

or

df = @groupby(df, _.a)

both forms are equivalent.

The remaining arguments of each query demand are command specific.

The following discussion will present each command in the version that accepts a source as the first argument.

The @map command

The @map command has the form @map(source, element_selector). source can be any source that can be queried. element_selector must be an anonymous function that accepts one element of the element type of the source and applies some transformation to this single element.

The @filter command

The @filter command has the form @filter(source, filter_condition). source can be any source that can be queried. filter_condition must be an anonymous function that accepts one element of the element type of the source and returns true if that element should be retained, and false if that element should be filtered out.

The @groupby command

There are two versions of the @groupby command. The simple version has the form @groupby(source, key_selector). source can be any source that can be queried. key_selector must be an anonymous function that returns a value for each element of source by which the source elements should be grouped.

The second variant has the form @groupby(source, key_selector, element_selector). The definition of source and key_selector is the same as in the simple variant. element_selector must be an anonymous function that is applied to each element of the source before that element is placed into a group, i.e. this is a projection function.

The return value of @groupby is an iterable of groups. Each group is itself a collection of data rows, and has a key field that is equal to the value the rows were grouped by. Often the next step in the pipeline will be to use @map with a function that acts on each group, summarizing it in a new data row.

The @orderby, @orderby_descending, @thenby and @thenby_descending command

There are four commands that are used to sort data. Any sorting has to start with either a @orderby or @orderby_descending command. @thenby and @thenby_descending commands can only directly follow a previous sorting command. They specify how ties in the previous sorting condition are to be resolved.

The general sorting command form is @orderby(source, key_selector). source can be any source than can be queried. key_selector must be an anonymous function that returns a value for each element of source. The elements of the source are then sorted is ascending order by the value returned from the key_selector function. The @orderby_descending command works in the same way, but sorts things in descending order. The @thenby and @thenby_descending command only accept the return value of any of the four sorting commands as their source, otherwise they have the same syntax as the @orderby and @orderby_descending commands.

The @groupjoin command

The @groupjoin command has the form @groupjoin(outer, inner, outer_selector, inner_selector, result_selector). outer and inner can be any source that can be queried. outer_selector and inner_selector must be an anonymous function that extracts the value from the outer and inner source respectively on which the join should be run. The result_selector must be an anonymous function that takes two arguments, first the element from the outer source, and second an array of those elements from the second source that are grouped together.

The @join command

The @join command has the form @join(outer, inner, outer_selector, inner_selector, result_selector). outer and inner can be any source that can be queried. outer_selector and inner_selector must be an anonymous function that extracts the value from the outer and inner source respectively on which the join should be run. The result_selector must be an anonymous function that takes two arguments. It will be called for each element in the result set, and the first argument will hold the element from the outer source and the second argument will hold the element from the inner source.

The @mapmany command

The @mapmany command has the form @mapmany(source, collection_selector, result_selector). source can be any source that can be queried. collection_selector must be an anonymous function that takes one argument and returns a collection. result_selector must be an anonymous function that takes two arguments. It will be applied to each element of the intermediate collection.

The @take command

The @take command has the form @take(source, n). source can be any source that can be queried. n must be an integer, and it specifies how many elements from the beginning of the source should be kept.

The @drop command

The @drop command has the form @drop(source, n). source can be any source that can be queried. n must be an integer, and it specifies how many elements from the beginning of the source should be dropped from the results.

The _ and __ syntax

This syntax only works in the standalone query commands. Instead of writing a full anonymous function, for example @map(i->i.a), one can write @map(_.a), where _ stands for the current element, i.e. has the same role as the argument of the anonymous function.

If one uses both _ and __, Query will automatically create an anonymous function with two arguments. For example, the result selector in the @join command requires an anonymous function that takes two arguments. This can be written succinctly like this:

using DataFrames, Query

df_parents = DataFrame(Name=["John", "Sally"])
df_children = DataFrame(Name=["Bill", "Joe", "Mary"], Parent=["John", "John", "Sally"])

df_parents |> @join(df_children, _.Name, _.Parent, {Parent=_.Name, Child=__.Name}) |> DataFrame