count
, count_if
, max_by
, min_by
and
approx_distinct
, all of these aggregate functions ignore null values
and return null for no input rows or when all values are null. For example,
sum
returns null rather than zero and avg
does not include null
values in the count. The coalesce
function can be used to convert null into
zero.
Ordering during aggregation
Some aggregate functions such asarray_agg
produce different results
depending on the order of input values. This ordering can be specified by writing
an order by clause within the aggregate function:
Filtering during aggregation
TheFILTER
keyword can be used to remove rows from aggregation processing
with a condition expressed using a WHERE
clause. This is evaluated for each
row before it is used in the aggregation and is supported for all aggregate
functions.
FILTER
to remove nulls from
consideration when using array_agg
WHERE
statement you lose information:
General aggregate functions
any_value
x
, if one exists. x
can be any
valid expression. This allows you to return values from columns that are not
directly part of the aggregation, inluding expressions using these columns,
in a query.
For example, the following query returns the customer name from the name
column, and returns the sum of all total prices as customer spend. The
aggregation however uses the rows grouped by the customer identifier
custkey
a required, since only that column is guaranteed to be unique:
arbitrary
x
, if one exists. Identical to any_value.
array_agg
x
elements.
avg
bool_and
TRUE
if every input value is TRUE
, otherwise FALSE
.
bool_or
TRUE
if any input value is TRUE
, otherwise FALSE
.
checksum
count
count_if
TRUE
input values.
This function is equivalent to count(CASE WHEN x THEN 1 END)
.
every
bool_and
.
geometric_mean
listagg
separator
string.
Synopsis:
separator
is not specified, the empty string will be used as separator
.
In its simplest form the function looks like:
1048576
bytes:
WITH COUNT
or WITHOUT COUNT
of omitted non-null values in case that the length of the output of the
function exceeds 1048576
bytes:
'...'
.
This aggregation function can be also used in a scenario involving grouping:
listagg
function does not support window frames.
max
n
largest values of all input values of x
.
max_by
x
associated with the maximum value of y
over all input values.
n
values of x
associated with the n
largest of all input values of y
in descending order of y
.
min
n
smallest values of all input values of x
.
min_by
x
associated with the minimum value of y
over all input values.
n
values of x
associated with the n
smallest of all input values of y
in ascending order of y
.
sum
Bitwise aggregate functions
bitwise_and_agg
bitwise_or_agg
Map aggregate functions
histogram
map_agg
key
/ value
pairs.
map_union
map_union
:
multimap_agg
key
/ value
pairs.
Each key can be associated with multiple values.
Approximate aggregate functions
approx_distinct
count(DISTINCT x)
.
Zero is returned if all input values are null.
This function should produce a standard error of 2.3%, which is the
standard deviation of the (approximately normal) error distribution over
all possible sets. It does not guarantee an upper bound on the error for
any specific input set.
count(DISTINCT x)
.
Zero is returned if all input values are null.
This function should produce a standard error of no more than e
, which
is the standard deviation of the (approximately normal) error distribution
over all possible sets. It does not guarantee an upper bound on the error
for any specific input set. The current implementation of this function
requires that e
be in the range of [0.0040625, 0.26000]
.
approx_most_frequent
buckets
elements approximately.
Approximate estimation of the function enables us to pick up the frequent
values with less memory. Larger capacity
improves the accuracy of
underlying algorithm with sacrificing the memory capacity. The returned
value is a map containing the top elements with corresponding estimated
frequency.
The error of the function depends on the permutation of the values and its
cardinality. We can set the capacity same as the cardinality of the
underlying data to achieve the least error.
buckets
and capacity
must be bigint
. value
can be numeric
or string type.
The function uses the stream summary data structure proposed in the paper
Efficient Computation of Frequent and Top-k Elements in Data Streams
by A. Metwalley, D. Agrawl and A. Abbadi.
approx_percentile
x
at the
given percentage
. The value of percentage
must be between zero and
one and must be constant for all input rows.
x
at each of
the specified percentages. Each element of the percentages
array must be
between zero and one, and the array must be constant for all input rows.
x
using the per-item weight w
at the percentage percentage
. Weights must be
greater or equal to 1. Integer-value weights can be thought of as a replication
count for the value x
in the percentile set. The value of percentage
must be
between zero and one and must be constant for all input rows.
x
using the per-item weight w
at each of the given percentages specified
in the array. Weights must be greater or equal to 1. Integer-value weights can
be thought of as a replication count for the value x
in the percentile
set. Each element of the percentages
array must be between zero and one, and the array
must be constant for all input rows.
approx_set
merge
numeric_histogram
buckets
number of buckets
for all value
s. This function is equivalent to the variant of
numeric_histogram
that takes a weight
, with a per-item weight of 1
.
buckets
number of buckets
for all value
s with a per-item weight of weight
. The algorithm
is based loosely on:
buckets
must be a bigint
. value
and weight
must be numeric.
qdigest_agg
tdigest_agg
Statistical aggregate functions
corr
covar_pop
covar_samp
kurtosis
regr_intercept
y
is the dependent
value. x
is the independent value.
regr_slope
y
is the dependent
value. x
is the independent value.
skewness
stddev
stddev_samp
.
stddev_pop
stddev_samp
variance
var_samp
.
var_pop
var_samp
Lambda aggregate functions
reduce_agg
inputFunction
will be invoked
for each non-null input value. In addition to taking the input value, inputFunction
takes the current state, initially initialState
, and returns the new state.
combineFunction
will be invoked to combine two states into a new state.
The final state is returned: