count
, count_if
, max_by
, min_by
and
approx_distinct
, all of these aggregate functions ignore null values
and return null for no input rows or when all values are null. For example,
sum
returns null rather than zero and avg
does not include null
values in the count. The coalesce
function can be used to convert null into
zero.
array_agg
produce different results
depending on the order of input values. This ordering can be specified by writing
an order by clause within the aggregate function:
FILTER
keyword can be used to remove rows from aggregation processing
with a condition expressed using a WHERE
clause. This is evaluated for each
row before it is used in the aggregation and is supported for all aggregate
functions.
FILTER
to remove nulls from
consideration when using array_agg
WHERE
statement you lose information:
x
, if one exists. x
can be any
valid expression. This allows you to return values from columns that are not
directly part of the aggregation, inluding expressions using these columns,
in a query.
For example, the following query returns the customer name from the name
column, and returns the sum of all total prices as customer spend. The
aggregation however uses the rows grouped by the customer identifier
custkey
a required, since only that column is guaranteed to be unique:
x
, if one exists. Identical to any_value.
x
elements.
TRUE
if every input value is TRUE
, otherwise FALSE
.
TRUE
if any input value is TRUE
, otherwise FALSE
.
TRUE
input values.
This function is equivalent to count(CASE WHEN x THEN 1 END)
.
bool_and
.
separator
string.
Synopsis:
separator
is not specified, the empty string will be used as separator
.
In its simplest form the function looks like:
1048576
bytes:
WITH COUNT
or WITHOUT COUNT
of omitted non-null values in case that the length of the output of the
function exceeds 1048576
bytes:
'...'
.
This aggregation function can be also used in a scenario involving grouping:
listagg
function does not support window frames.
n
largest values of all input values of x
.
x
associated with the maximum value of y
over all input values.
n
values of x
associated with the n
largest of all input values of y
in descending order of y
.
n
smallest values of all input values of x
.
x
associated with the minimum value of y
over all input values.
n
values of x
associated with the n
smallest of all input values of y
in ascending order of y
.
key
/ value
pairs.
map_union
:
key
/ value
pairs.
Each key can be associated with multiple values.
count(DISTINCT x)
.
Zero is returned if all input values are null.
This function should produce a standard error of 2.3%, which is the
standard deviation of the (approximately normal) error distribution over
all possible sets. It does not guarantee an upper bound on the error for
any specific input set.
count(DISTINCT x)
.
Zero is returned if all input values are null.
This function should produce a standard error of no more than e
, which
is the standard deviation of the (approximately normal) error distribution
over all possible sets. It does not guarantee an upper bound on the error
for any specific input set. The current implementation of this function
requires that e
be in the range of [0.0040625, 0.26000]
.
buckets
elements approximately.
Approximate estimation of the function enables us to pick up the frequent
values with less memory. Larger capacity
improves the accuracy of
underlying algorithm with sacrificing the memory capacity. The returned
value is a map containing the top elements with corresponding estimated
frequency.
The error of the function depends on the permutation of the values and its
cardinality. We can set the capacity same as the cardinality of the
underlying data to achieve the least error.
buckets
and capacity
must be bigint
. value
can be numeric
or string type.
The function uses the stream summary data structure proposed in the paper
Efficient Computation of Frequent and Top-k Elements in Data Streams
by A. Metwalley, D. Agrawl and A. Abbadi.
x
at the
given percentage
. The value of percentage
must be between zero and
one and must be constant for all input rows.
x
at each of
the specified percentages. Each element of the percentages
array must be
between zero and one, and the array must be constant for all input rows.
x
using the per-item weight w
at the percentage percentage
. Weights must be
greater or equal to 1. Integer-value weights can be thought of as a replication
count for the value x
in the percentile set. The value of percentage
must be
between zero and one and must be constant for all input rows.
x
using the per-item weight w
at each of the given percentages specified
in the array. Weights must be greater or equal to 1. Integer-value weights can
be thought of as a replication count for the value x
in the percentile
set. Each element of the percentages
array must be between zero and one, and the array
must be constant for all input rows.
buckets
number of buckets
for all value
s. This function is equivalent to the variant of
numeric_histogram
that takes a weight
, with a per-item weight of 1
.
buckets
number of buckets
for all value
s with a per-item weight of weight
. The algorithm
is based loosely on:
buckets
must be a bigint
. value
and weight
must be numeric.
y
is the dependent
value. x
is the independent value.
y
is the dependent
value. x
is the independent value.
stddev_samp
.
var_samp
.
inputFunction
will be invoked
for each non-null input value. In addition to taking the input value, inputFunction
takes the current state, initially initialState
, and returns the new state.
combineFunction
will be invoked to combine two states into a new state.
The final state is returned: