diff --git a/presto-docs/src/main/sphinx/functions.rst b/presto-docs/src/main/sphinx/functions.rst index 7ac3846d4ccd..512cafd8e42b 100644 --- a/presto-docs/src/main/sphinx/functions.rst +++ b/presto-docs/src/main/sphinx/functions.rst @@ -19,6 +19,7 @@ Functions and Operators functions/json functions/datetime functions/aggregate + functions/noisy functions/window functions/array functions/map diff --git a/presto-docs/src/main/sphinx/functions/aggregate.rst b/presto-docs/src/main/sphinx/functions/aggregate.rst index b7a2a39afb2d..1ae441291dc5 100644 --- a/presto-docs/src/main/sphinx/functions/aggregate.rst +++ b/presto-docs/src/main/sphinx/functions/aggregate.rst @@ -2,6 +2,14 @@ Aggregate Functions =================== +.. contents:: + :local: + :backlinks: none + :depth: 1 + +Overview +-------- + Aggregate functions operate on a set of values to compute a single result. Except for :func:`count`, :func:`count_if`, :func:`max_by`, :func:`min_by` and @@ -370,176 +378,6 @@ Approximate Aggregate Functions :func:`numeric_histogram` that takes a ``weight``, with a per-item weight of ``1``. In this case, the total weight in the returned map is the count of items in the bin. -.. function:: noisy_count_gaussian(x, noise_scale) -> bigint - - Counts the non-null values and then adds a random Gaussian noise - with 0 mean and standard deviation of ``noise_scale`` to the true count. - The noisy count is post-processed to be non-negative and rounded to bigint. - - When there are no input rows, this function returns ``NULL``. - - Noise is from a secure random. :: - - SELECT noisy_count_gaussian(orderkey, 20.0) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_count_gaussian(orderkey, 20.0) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_count_gaussian(x, noise_scale, random_seed) -> bigint - - Counts the non-null values and then adds a random Gaussian noise - with 0 mean and standard deviation of ``noise_scale`` to the true count. - The noisy count is post-processed to be non-negative and rounded to bigint. - - When there are no input rows, this function returns ``NULL``. - - Random seed is used to seed the random generator. - This method does not use a secure random. :: - - SELECT noisy_count_gaussian(orderkey, 20.0, 321) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_count_gaussian(orderkey, 20.0, 321) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_count_if_gaussian(x, noise_scale) -> bigint - - Counts the `TRUE` values and then adds a random Gaussian noise - with 0 mean and standard deviation of ``noise_scale`` to the true count. - The noisy count is post-processed to be non-negative and rounded to bigint. - - When there are no input rows, this function returns ``NULL``. - - Noise is from a secure random. :: - - SELECT noisy_count_if_gaussian(orderkey > 10, 20.0) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_count_if_gaussian(orderkey > 10, 20.0) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_count_if_gaussian(x, noise_scale, random_seed) -> bigint - - Counts the `TRUE` values and then adds a random Gaussian noise - with 0 mean and standard deviation of ``noise_scale`` to the true count. - The noisy count is post-processed to be non-negative and rounded to bigint. - - When there are no input rows, this function returns ``NULL``. - - Random seed is used to seed the random generator. - This method does not use a secure random. :: - - SELECT noisy_count_if_gaussian(orderkey > 10, 20.0, 321) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_count_if_gaussian(orderkey > 10, 20.0, 321) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_sum_gaussian(x, noise_scale) -> double - - Calculates the sum over the input values and then adds random Gaussian noise - with 0 mean and standard deviation of ``noise_scale``. - All values are converted to double before being added to the sum, and the return type is double. - - When there are no input rows, this function returns ``NULL``. - - Noise is from a secure random. :: - - SELECT noisy_sum_gaussian(orderkey, 20.0) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_sum_gaussian(orderkey, 20.0) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_sum_gaussian(x, noise_scale, random_seed) -> double - - Calculates the sum over the input values and then adds random Gaussian noise - with 0 mean and standard deviation of ``noise_scale``. - All values are converted to double before being added to the sum, and the return type is double. - - When there are no input rows, this function returns ``NULL``. - - Random seed is used to seed the random generator. - This method does not use a secure random. :: - - SELECT noisy_sum_gaussian(orderkey, 20.0, 321) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_sum_gaussian(orderkey, 20.0, 321) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_sum_gaussian(x, noise_scale, lower, upper) -> double - - Calculates the sum over the input values and then adds random Gaussian noise - with 0 mean and standard deviation of ``noise_scale``. - All values are converted to double before being added to the sum, and the return type is double. - - Each value is clipped to the range of ``[lower, upper]`` before adding to the sum. - - When there are no input rows, this function returns ``NULL``. - - Noise is from a secure random. :: - - SELECT noisy_sum_gaussian(orderkey, 20.0, 10.0, 50.0) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_sum_gaussian(orderkey, 20.0, 10.0, 51.0) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_sum_gaussian(x, noise_scale, lower, upper, random_seed) -> double - - Calculates the sum over the input values and then adds random Gaussian noise - with 0 mean and standard deviation of ``noise_scale``. - All values are converted to double before being added to the sum, and the return type is double. - - Each value is clipped to the range of ``[lower, upper]`` before adding to the sum. - - When there are no input rows, this function returns ``NULL``. - - Random seed is used to seed the random generator. - This method does not use a secure random. :: - - SELECT noisy_sum_gaussian(orderkey, 20.0, 10.0, 50.0, 321) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_sum_gaussian(orderkey, 20.0, 10.0, 50.0, 321) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_avg_gaussian(x, noise_scale) -> double - - Calculates the average (arithmetic mean) of all the input values and then adds random Gaussian noise - with 0 mean and standard deviation of ``noise_scale``. - All values are converted to double before being added to the avg, and the return type is double. - - When there are no input rows, this function returns ``NULL``. - - Noise is from a secure random. :: - - SELECT noisy_avg_gaussian(orderkey, 20.0) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_avg_gaussian(orderkey, 20.0) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_avg_gaussian(x, noise_scale, random_seed) -> double - - Calculates the average (arithmetic mean) of all the input values and then adds random Gaussian noise - with 0 mean and standard deviation of ``noise_scale``. - All values are converted to double before being added to the avg, and the return type is double. - - When there are no input rows, this function returns ``NULL``. - - Random seed is used to seed the random generator. - This method does not use a secure random. :: - - SELECT noisy_avg_gaussian(orderkey, 20.0, 321) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_avg_gaussian(orderkey, 20.0, 321) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_avg_gaussian(x, noise_scale, lower, upper) -> double - - Calculates the average (arithmetic mean) of all the input values and then adds random Gaussian noise - with 0 mean and standard deviation of ``noise_scale``. - All values are converted to double before being added to the avg, and the return type is double. - - Each value is clipped to the range of ``[lower, upper]`` before adding to the avg. - - When there are no input rows, this function returns ``NULL``. - - Noise is from a secure random. :: - - SELECT noisy_avg_gaussian(orderkey, 20.0, 10.0, 50.0) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_avg_gaussian(orderkey, 20.0, 10.0, 51.0) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - -.. function:: noisy_avg_gaussian(x, noise_scale, lower, upper, random_seed) -> double - - Calculates the average (arithmetic mean) of all the input values and then adds random Gaussian noise - with 0 mean and standard deviation of ``noise_scale``. - All values are converted to double before being added to the avg, and the return type is double. - - Each value is clipped to the range of ``[lower, upper]`` before adding to the avg. - - When there are no input rows, this function returns ``NULL``. - - Random seed is used to seed the random generator. - This method does not use a secure random. :: - - SELECT noisy_avg_gaussian(orderkey, 20.0, 10.0, 50.0, 321) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) - SELECT noisy_avg_gaussian(orderkey, 20.0, 10.0, 50.0, 321) FROM tpch.tiny.lineitem WHERE false GROUP BY orderkey; -- (0 row) - Statistical Aggregate Functions ------------------------------- @@ -1108,7 +946,12 @@ equal probability of being chosen. See [Vitter1985]_. -----------------+---------------------------------- 5 | [{idx=1, val=a}, {idx=5, val=e}] - + +Noisy Aggregate Functions +------------------------- + +See :doc:`noisy`. + --------------------------- diff --git a/presto-docs/src/main/sphinx/functions/noisy.rst b/presto-docs/src/main/sphinx/functions/noisy.rst new file mode 100644 index 000000000000..e9e62253b8f8 --- /dev/null +++ b/presto-docs/src/main/sphinx/functions/noisy.rst @@ -0,0 +1,207 @@ +========================= +Noisy Aggregate Functions +========================= + +.. contents:: + :local: + :backlinks: none + :depth: 1 + +Overview +-------- + +Noisy aggregate functions are functions that provide random, noisy +approximations of common aggregations like :func:`sum`, :func:`count`, and +:func:`approx_distinct` as well as sketches like :func:`approx_set`. By +injecting random noise into results, noisy aggregation functions make it more +difficult to determine or confirm the exact data that was aggregated. + +While many of these functions resemble +`differential privacy `_ +mechanisms, neither the values returned by these functions nor the query results +that incorporate these functions are differentially private in general. +See `Limitations`_ below for more details. Users who wish to support a strong +privacy guarantee should discuss with a suitable technical expert first. + +Counts, Sums, and Averages +-------------------------- + +.. function:: noisy_count_gaussian(col, noise_scale[, random_seed]) -> bigint + + Counts the non-``NULL`` values in ``col`` and then adds a normally distributed + random double value with 0 mean and standard deviation of ``noise_scale`` to the true count. + The noisy count is post-processed to be non-negative and rounded to bigint. + + If provided, ``random_seed`` is used to seed the random number generator. Otherwise, + noise is drawn from a secure random. :: + + SELECT noisy_count_gaussian(orderkey, 20.0) FROM tpch.tiny.lineitem; -- 60179 (1 row) + SELECT noisy_count_gaussian(orderkey, 20.0) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) + + .. note:: + + Unlike :func:`count`, this function returns ``NULL`` when the (true) count of ``col`` is 0. + + Distinct counting can be performed using ``noisy_count_gaussian(DISTINCT col, ...)``, or with + :func:`noisy_approx_distinct_sfm`. Generally speaking, :func:`noisy_count_gaussian` + returns more accurate results but at a larger computational cost. + +.. function:: noisy_count_if_gaussian(col, noise_scale[, random_seed]) -> bigint + + Counts the ``TRUE`` values in ``col`` and then adds a normally distributed random double + value with 0 mean and standard deviation of ``noise_scale`` to the true count. + The noisy count is post-processed to be non-negative and rounded to bigint. + + If provided, ``random_seed`` is used to seed the random number generator. Otherwise, + noise is drawn from a secure random. :: + + SELECT noisy_count_if_gaussian(orderkey > 10000, 20.0) FROM tpch.tiny.lineitem; -- 50180 (1 row) + SELECT noisy_count_if_gaussian(orderkey > 10000, 20.0) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row) + + .. note:: + + Unlike :func:`count_if`, this function returns ``NULL`` when the (true) count is 0. + +.. function:: noisy_sum_gaussian(col, noise_scale, lower, upper[, random_seed]) -> double + + Calculates the sum over the input values in ``col`` and then adds a normally distributed + random double value with 0 mean and standard deviation of noise_scale. Each value is clipped to the range + of [``lower``, ``upper``] before adding to the sum. + + If provided, ``random_seed`` is used to seed the random number generator. Otherwise, + noise is drawn from a secure random. + +.. function:: noisy_sum_gaussian(col, noise_scale[, random_seed]) -> double + :noindex: + + Calculates the sum over the input values in ``col`` and then adds a normally + distributed random double value with 0 mean and standard deviation of ``noise_scale``. + + If provided, ``random_seed`` is used to seed the random number generator. Otherwise, + noise is drawn from a secure random. + +.. function:: noisy_avg_gaussian(col, noise_scale, lower, upper[, random_seed]) -> double + + Calculates the average (arithmetic mean) of all the input values in ``col`` and then + adds a normally distributed random double value with 0 mean and standard deviation of ``noise_scale``. + Each value is clipped to the range of [``lower``, ``upper``] before averaging. + + If provided, ``random_seed`` is used to seed the random number generator. Otherwise, + noise is drawn from a secure random. + +.. function:: noisy_avg_gaussian(col, noise_scale[, random_seed]) -> double + :noindex: + + Calculates the average (arithmetic mean) of all the input values in ``col`` and then adds + a normally distributed random double value with 0 mean and standard deviation of ``noise_scale``. + + If provided, ``random_seed`` is used to seed the random number generator. Otherwise, + noise is drawn from a secure random. + + +Approximate Distinct Counting/Sketching +--------------------------------------- + +Noisy approximate distinct counting and sketching (analogous to the deterministic :doc:`hyperloglog`) +is supported via the Sketch-Flip-Merge (SFM) data sketch [Hehir2023]_. + +.. function:: noisy_approx_set_sfm(col, epsilon[, buckets[, precision]]) -> SfmSketch + + Returns an SFM sketch of the input values in ``col``. This is analogous to the + :func:`approx_set` function, which returns a (deterministic) HyperLogLog sketch. + + - ``col`` supports many types, similar to ``HyperLogLog``. + - ``epsilon`` (double) is a positive number that controls the level of noise in + the sketch, as described in [Hehir2023]_. Smaller values of epsilon correspond + to noisier sketches. + - ``buckets`` (int) defaults to 4096. + - ``precision`` (int) defaults to 24. + + + .. note:: + + Unlike :func:`approx_set`, this function returns ``NULL`` when ``col`` is empty. + If this behavior is undesirable, use :func:`coalesce` with :func:`noisy_empty_approx_set_sfm`. + +.. function:: noisy_approx_distinct_sfm(col, epsilon[, buckets[, precision]]) -> bigint + + Equivalent to ``cardinality(noisy_approx_set_sfm(col, epsilon, buckets, precision))``, + this returns the approximate cardinality (distinct count) of the column ``col``. + This is analogous to the (deterministic) :func:`approx_distinct` function. + + .. note:: + + Unlike :func:`approx_distinct`, this function returns ``NULL`` when ``col`` is empty. + +.. function:: noisy_empty_approx_set_sfm(epsilon[, buckets[, precision]]) -> SfmSketch + + Returns an SFM sketch with no items in it. This is analogous to the :func:`empty_approx_set` + function, which returns an empty (deterministic) HyperLogLog sketch. + + - ``epsilon`` (double) is a positive number that controls the level of noise in the sketch, + as described in [Hehir2023]_. Smaller values of epsilon correspond to noisier sketches. + - ``buckets`` (int) defaults to 4096. + - ``precision`` (int) defaults to 24. + +.. function:: cardinality(SfmSketch) -> bigint + + Returns the estimated cardinality (distinct count) of an ``SfmSketch`` object. + +.. function:: merge(SfmSketch) -> SfmSketch + + An aggregator function that returns a merged ``SfmSketch`` of the set union of + individual ``SfmSketch`` objects, similar to ``merge(HyperLogLog)``. :: + + SELECT year, cardinality(merge(sketch)) AS annual_distinct_count + FROM monthly_sketches + GROUP BY 1 + +.. function:: merge_sfm(ARRAY[SfmSketch, ...]) -> SfmSketch + + A scalar function that returns a merged ``SfmSketch`` of the set union of an array + of ``SfmSketch`` objects, similar to :func:`merge_hll`. :: + + SELECT cardinality(merge_sfm(ARRAY[ + noisy_approx_set_sfm(col_1, 5.0), + noisy_approx_set_sfm(col_2, 5.0), + noisy_approx_set_sfm(col_3, 5.0) + ])) AS distinct_count_over_3_cols + FROM my_table + + +Limitations +----------- + +While these functions resemble differential privacy mechanisms, the values returned +by these functions are not differentially private in general. There are several +important limitations to keep in mind if using these functions for +privacy-preserving purposes, including: + +- All noisy aggregate functions return ``NULL`` when aggregating empty sets. + This means a ``NULL`` return value noiselessly indicates the absence of data. +- ``GROUP BY`` clauses used in combination with noisy aggregation functions + reveal non-noisy information: the presence or absence of a group noiselessly + indicates the presence or absence of data. See, e.g., [Wilkins2024]_. +- Functions relying on floating-point noise may be susceptible to inference + attacks such as those identified in [Mironov2012]_ and [Casacuberta2022]_. + +--------------------------- + +.. [Casacuberta2022] Casacuberta, S., Shoemate, M., Vadhan, S., & Wagaman, C. + (2022). `Widespread Underestimation of Sensitivity in Differentially Private + Libraries and How to Fix It. `_ In *Proceedings + of the 2022 ACM SIGSAC Conference on Computer and Communications Security* (pp. 471-484). + +.. [Hehir2023] Hehir, J., Ting, D., & Cormode, G. (2023). `Sketch-Flip-Merge: + Mergeable Sketches for Private Distinct Counting. + `_ In *Proceedings of + the 40th International Conference on Machine Learning* (Vol. 202). + +.. [Mironov2012] Mironov, I. (2012). `On significance of the least significant bits + for differential privacy. `_ + In *Proceedings of the 2012 ACM Conference on Computer and Communications Security* (pp. 650-661). + +.. [Wilkins2024] Wilkins, A., Kifer, D., Zhang, D., & Karrer, B. (2024). `Exact + Privacy Analysis of the Gaussian Sparse Histogram Mechanism. + `_ + *Journal of Privacy and Confidentiality*, 14 (1). diff --git a/presto-docs/src/main/sphinx/language/types.rst b/presto-docs/src/main/sphinx/language/types.rst index 90de0f4b7a25..4fb056e2c78a 100644 --- a/presto-docs/src/main/sphinx/language/types.rst +++ b/presto-docs/src/main/sphinx/language/types.rst @@ -354,6 +354,17 @@ The similarity of any two sets is estimated by comparing their signatures. SetDigests are additive, meaning they can be merged together. +SFM Sketch +----------- + +.. _sfmsketch_type: + +``SfmSketch`` +^^^^^^^^^^^^^ + +The Sketch-Flip-Merge (SFM) data sketch is a noisy, random distinct-counting +sketch similar to :ref:`hyperloglog_type`. See :func:`noisy_approx_set_sfm`. + Quantile Digest ---------------