龙空技术网

PG中的查询:2.统计

帅的数说 72

前言:

此时姐妹们对“排列组合cn1怎么算”可能比较讲究,看官们都需要了解一些“排列组合cn1怎么算”的相关资讯。那么小编在网摘上汇集了一些有关“排列组合cn1怎么算””的相关文章,希望各位老铁们能喜欢,兄弟们快快来学习一下吧!

PG中的查询:2.统计

本节讨论成本优化器的基础:统计。通过示例进行讲解。这里会由很多执行计划,后续会更加详细讨论这些计划如何运行。现在只需要注意每个计划的第一行看到的数字以及行数。这些是行数估计值。

基本统计

pg_class系统表存储着基本关系级别的统计信息。统计信息包括:

1) 关系的行数reltuples

2) 关系大小,以页为单位relpages

3) 关系visibility map中被标记的页的页数relallvisible

SELECT reltuples, relpages, relallvisibleFROM pg_class WHERE relname = 'flights';reltuples | relpages | relallvisible−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−−−−214867 | 2624 | 2624(1 row)

对于没有过滤条件的查询,基数估算值等于reltuples:

EXPLAIN SELECT * FROM flights;QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights (cost=0.00..4772.67 rows=214867 width=63)(1 row)

自动或手动分析期间采集统计信息。基本统计数据是重要信息,在执行某些操作时也会计算处理,例如VACUUM FULL和CLUSTER或CREATE INDEX和REINDEX。系统还会在VACCUM期间更新统计信息。

为采集统计信息,分析器随机select 300*default_statistics_target行数(默认值是100,因此总共为30000行)。此处未考虑表大小,因为总体数据集大小对足以进行精确统计的样本大小没有影响。

从300*default_statistics_target随机页中选择随机行。如果表比预期的样本大小小,分析器读取整个表

大表中,统计数据将不准确。因为分析器不会扫描每一行。即便扫描每一行,统计数据也总会有过期,因为表中数据一直在变化。无论如何,我们不需要统计数据那么精确:高达一个数量级的变化仍然足够准确以产生适当的计划。让我们创建一个禁用自动vacuum的表的副本flights,以便我们可以控制何时进行分析。

CREATE TABLE flights_copy(LIKE flights)WITH (autovacuum_enabled = false);

新表中还没有统计信息:

SELECT reltuples, relpages, relallvisibleFROM pg_class WHERE relname = 'flights_copy';reltuples | relpages | relallvisible−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−−−−−1 | 0 | 0(1 row)

reltuples=-1(PG14及更高版本)帮助我们区分从没采集统计信息的表和空表。通常情况下,新创建的表会立即填充,规划器对新表无感知,因此默认情况下假定该表10页:

EXPLAIN SELECT * FROM flights_copy;QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights_copy (cost=0.00..14.10 rows=410 width=170)(1 row)

规划器基于单行宽度计算行个数。宽度通常是在分析期间计算的平均值。但是,这次没有分析数据,因此系统根据列数据类型来估算宽度。从flights表拷贝数据到新表然后执行分析器:

INSERT INTO flights_copy SELECT * FROM flights;INSERT 0 214867ANALYZE flights_copy;

现在统计信息匹配真实行数。该表足够紧凑,分析器可以遍历每一行:

SELECT reltuples, relpages, relallvisibleFROM pg_class WHERE relname = 'flights_copy';reltuples | relpages | relallvisible−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−−−−214867 | 2624 | 0(1 row)

Vacuum后relallvisible值会更新:

VACUUM flights_copy;SELECT relallvisible FROM pg_class WHERE relname = 'flights_copy';relallvisible−−−−−−−−−−−−−−−2624(1 row)

评估index-only扫描代价的时候会用到这个值。

我们保留老的统计信息,插入1倍元组,看下规划器得到的基数是多少:

INSERT INTO flights_copy SELECT * FROM flights;SELECT count(*) FROM flights_copy;count−−−−−−−−429734(1 row)
EXPLAIN SELECT * FROM flights_copy;QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights_copy (cost=0.00..9545.34 rows=429734 width=63)(1 row)

尽管pg_class数据已过时,但该估计是准确的:

SELECT reltuples, relpagesFROM pg_class WHERE relname = 'flights_copy';reltuples | relpages−−−−−−−−−−−+−−−−−−−−−−214867 | 2624(1 row)

规划器注意到数据文件的大小不再匹配旧的relpages值,因此reltuples适当缩放以提高准确性。文件大小增加了1倍,因此行数也应该相应调整(假设数据密度不变):

SELECT reltuples *(pg_relation_size('flights_copy') / 8192) / relpagesFROM pg_class WHERE relname = 'flights_copy';?column?−−−−−−−−−−429734(1 row)

这种调整并不总是有效,例如可以删除几行,但估算值不会变化。但当发生较大变化时,这种方法可以让统计数据保持不变,直到analyze。

NULL值

虽然正统主义者看不起,但是NULL值可以方便地表示未知或者不存在的值。但是特殊值需要特殊处理。使用NULL值时需要考虑一些实际的注意事项。布尔逻辑变成三进制,NOT IN构造开始表现的很奇怪。目前尚不清楚NULL值是否被视为低于或者高于常规值(特殊从句NULLS FIRST和NULLS LAST帮助)。聚合函数中使用NULL值也很粗略。因为NULL值实际上根本不是值,规划器需要额外的数据来容纳他们。

除了基本的关系级别统计信息外,分析器还收集关系中每一列的统计信息。此数据存储在pg_statistic系统表中,可以使用pg_stats视图方便地显示。

NULL值的分数是列级别的统计信息。被指定为pg_stats中的null_frac。本例中,一些飞机还没起飞,所以他们的起飞时间是不确定的:

EXPLAIN SELECT * FROM flights WHERE actual_departure IS NULL;QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights (cost=0.00..4772.67 rows=16036 width=63)Filter: (actual_departure IS NULL)(2 rows)

优化器将总行数乘以NULL分数:

SELECT round(reltuples * s.null_frac) AS rowsFROM pg_classJOIN pg_stats s ON s.tablename = relnameWHERE s.tablename = 'flights'AND s.attname = 'actual_departure';rows−−−−−−−16036(1 row)

这与 16348 的真实值足够接近。

Distinct值

一列中distinct值个数存储在pg_stats的n_distinct字段。如果n_distinct为负值,则其绝对值表示不同值的比例。例如,对于-1值,表示这列的值都是唯一的。当不同值的数量达到行数的10%或更多时,分析器将切换到分数模式。此时当修改数据时该比例通常会保持不变。如果不同值的数量计算不准确(因为样本恰好不具有代表性),您可以手动设置此值:

ALTER TABLE ... ALTER COLUMN ... SET (n_distinct = ...);

在数据均匀分布下,不同值的数量很有用。考虑“column = expression”子句的基数估计。如果在规划阶段表达式值未知,则规划器假定表达式同样可能从列中返回任何值。

EXPLAINSELECT * FROM flights WHERE departure_airport = (SELECT airport_code FROM airports WHERE city = 'Saint Petersburg');QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights (cost=30.56..5340.40 rows=2066 width=63)Filter: (departure_airport = $0)InitPlan 1 (returns $0)−> Seq Scan on airports_data ml (cost=0.00..30.56 rows=1 wi...Filter: ((city −>> lang()) = 'Saint Petersburg'::text)(5 rows)

InitPlan节点只执行一次,然后在主计划中使用改制而不是$0。

SELECT round(reltuples / s.n_distinct) AS rowsFROM pg_classJOIN pg_stats s ON s.tablename = relnameWHERE s.tablename = 'flights'AND s.attname = 'departure_airport';rows−−−−−−2066(1 row)如果所有数据均匀分布,则这些统计数据(连同最小值和最大值)足以进行准确的估计。不幸的是,这种估算不适用于非均匀分布,后者更为常见:

SELECT min(cnt), round(avg(cnt)) avg, max(cnt) FROM (

SELECT departure_airport, count(*) cntFROM flights GROUP BY departure_airport) t;min | avg | max−−−−−+−−−−−−+−−−−−−−113 | 2066 | 20875(1 row)
最常见的值

为提高非均匀分布的估算精度,分析器通常收集最常见值及其频率的统计信息。这些值存储在pg_stats的most_common_vals和most_common_freqs中。

以下是最常见飞机类型的此类统计数据示例:

SELECT most_common_vals AS mcv,left(most_common_freqs::text,60) || '...' AS mcfFROM pg_statsWHERE tablename = 'flights' AND attname = 'aircraft_code' \gx−[ RECORD 1 ]−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−mcv | {CN1,CR2,SU9,321,763,733,319,773}mcf | {0.2783,0.27473333,0.25816667,0.059233334,0.038533334,0.0370...

估算“column = expression”的选择性非常简单:规划器只需从most_common_vals数组中获取一个值,然后将其乘以相同位置的频率most_common_freqs。

EXPLAIN SELECT * FROM flights WHERE aircraft_code = '733';QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights (cost=0.00..5309.84 rows=7957 width=63)Filter: (aircraft_code = '733'::bpchar)(2 rows)
SELECT round(reltuples * s.most_common_freqs[array_position((s.most_common_vals::text::text[]),'733')])FROM pg_classJOIN pg_stats s ON s.tablename = relnameWHERE s.tablename = 'flights'AND s.attname = 'aircraft_code';round−−−−−−−7957(1 row)

这个估算值将接近8263的真实值。

MCV列表也用于不等式的选择性估计:为了找到“column < value”的选择性,规划器搜索most_common_vals所有低于给定值的值,然后将他们的频率相加most_common_freqs。

当不同值数量较少时,公共值统计最有效。MCV数组的最大大小由default_statistics_target控制,该参数与分析期间控制行样本大小的参数相同。

某些情况下,将值(以及数组大小)增加到超出默认值将提供更加准确的统计。可以为每列设置此值:

ALTER TABLE ... ALTER COLUMN ... SET STATISTICS ...;

行样本大小也会增加,但仅限于表。公共值数组存储值本身,并且根据值的不同,可能会占用大量空间。这就是为什么超过1KB的值被排除在分析和统计之外的原因。它可以使pg_statistic大小在控制内,并且不会使规划器超载。无论如何,这么大的值通常是不同的,不包含在most_common_vals内。

直方图

当不同值的数量变得太大而无法将他们全部存储在数组时,系统开始使用直方图表示。直方图使用多个存储桶来存储值。存储桶的数量受相同的default_statistics_target参数限制。每个桶的宽度以这样一种方式选择,即在他们之间均匀分布值(如图上具有大致相同面积的矩形表示)。这种表示使系统能够只存储直方图边界,而不是浪费空间来存储每个桶的频率。直方图不包括MCV列表中的值。边界存储在pg_stats的histogram_bounds字段。任何桶的汇总频率等于1/桶数。

直方图存储为存储桶边界数组:

SELECT left(histogram_bounds::text,60) || '...' AS histogram_boundsFROM pg_stats sWHERE s.tablename = 'boarding_passes' AND s.attname = 'seat_no';histogram_bounds−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−{10B,10D,10D,10F,11B,11C,11H,12H,13B,14B,14H,15H,16D,16D,16H...(1 row)

除其他外,直方图与MCV列表一起用于估计“大于”和“小于”操作的选择性。示例:计算为后排座位签发登机牌数量:

EXPLAIN SELECT * FROM boarding_passes WHERE seat_no > '30C';QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on boarding_passes (cost=0.00..157353.30 rows=2943394 ...Filter: ((seat_no)::text > '30C'::text)(2 rows)

截止座位号专门选择在两个桶之间的边缘。此条件的选择性是N/桶数。其中N是具有匹配值的桶数(在截止点右侧)。请记住,直方图没有考虑最常见的值和未定义的值。我们先看看匹配最常见值的分数:

SELECT sum(s.most_common_freqs[array_position((s.most_common_vals::text::text[]),v)])FROM pg_stats s, unnest(s.most_common_vals::text::text[]) vWHERE s.tablename = 'boarding_passes' AND s.attname = 'seat_no'AND v > '30C';sum−−−−−−−−0.2127(1 row)

看看最常见值的分数(从直方图中排除):

SELECT sum(s.most_common_freqs[array_position((s.most_common_vals::text::text[]),v)])FROM pg_stats s, unnest(s.most_common_vals::text::text[]) vWHERE s.tablename = 'boarding_passes' AND s.attname = 'seat_no';sum−−−−−−−−0.6762(1 row)

seat_no列中没有NULL值:

SELECT s.null_fracFROM pg_stats sWHERE s.tablename = 'boarding_passes' AND s.attname = 'seat_no';null_frac−−−−−−−−−−−0(1 row)

间隔正好涵盖49个桶(总共100个),结果估算:

SELECT round( reltuples * (0.2127 -- from most common values+ (1 - 0.6762 - 0) * (49 / 100.0) -- from histogram))FROM pg_class WHERE relname = 'boarding_passes';round−−−−−−−−−2943394(1 row)

真实值是2986429.当截止值不再桶边缘时,该桶的匹配分数是使用线性插值计算的。

较高的default_statistis_target值可能会提高估计精度,但直方图与MCV列表一起已经产生了很好的结果,即使有大量不同值:

SELECT n_distinct FROM pg_statsWHERE tablename = 'boarding_passes' AND attname = 'seat_no';n_distinct−−−−−−−−−−−−461(1 row)

更高的估计精度只有在提高规划质量时才有用。在没有正当理由下增加default_statistics_target可能会减慢分析和计划,但对优化没有影响。另一方面,降低参数(一直降到0)可能会提高分析和计划速度,但也可能导致计划质量地下,因此这种“节省时间”很少被证明是合理的。

非标量数据类型的统计

非标量数据类型的统计信息可能包括非标量值本身的分布数据,还包括组成元素的分布数据。这允许查询非第一范式中的列时进行更准确的计划。

1) 数组most_common_elems和most_common_elem_freqs 包含最常见的元素及其频率。这些统计数据被采集,并用于估计阵列和tsvector数据的选择性。

2) elem_count_histogram数组是不同值的数量的直方图。采集这些数据,并仅用于估计阵列的选择性

3) 对于范围数据类型,直方图用于表示范围长度的分布及下限和上限的分布。然后,这些直方图有助于估计使用这些数据类型的各种操作的选择性。他们未显示在pg_stats中。

这些统计信息也用于PG14中引入的多范围数据类型。

平均字段宽度

pg_stats中的avg_width字段表示一列中平均字段宽度。数据类型比如integer或char(3)的字段宽度是固定的,但是当使用没有设置宽度的数据类型(例如text)时,值可能会因列而异。

SELECT attname, avg_width FROM pg_statsWHERE (tablename, attname) IN ( VALUES('tickets', 'passenger_name'), ('ticket_flights','fare_conditions'));attname | avg_width−−−−−−−−−−−−−−−−−+−−−−−−−−−−−fare_conditions | 8passenger_name | 16(2 rows)

这些信息帮助评估排序或哈希等操作的内存使用。

相关性

pg_stats中correlation字段表示磁盘上物理行序和列值的逻辑序(大于或者小于)之间的关系,值从-1到1。如果值存储有序,相关值接近1。如果值反向有序,相关值接近于-1。数据在磁盘上分布越混乱,值越接近于0。

SELECT attname, correlationFROM pg_stats WHERE tablename = 'airports_data'ORDER BY abs(correlation) DESC;attname | correlation−−−−−−−−−−−−−−+−−−−−−−−−−−−−coordinates |airport_code | −0.21120238city | −0.1970127airport_name | −0.18223621timezone | 0.17961165(5 rows)

coordinates无法收集列的统计信息,因为没有为点数据类型定义比较操作(小于和大于)。相关行用于索引扫描成本估计。

表达式统计

通常来讲,列统计信息只在操作调用列本身时使用,而不是用于以列为参数的表达式。规划器并不知道函数如何影响列的统计信息。因此向“function-call = constant”这样的条件,总是估计在0.5%。

EXPLAIN SELECT * FROM flightsWHERE extract(month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow') = 1;QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights (cost=0.00..6384.17 rows=1074 width=63)Filter: (EXTRACT(month FROM (scheduled_departure AT TIME ZONE ...(2 rows)SELECT round(reltuples * 0.005)FROM pg_class WHERE relname = 'flights';round−−−−−−−1074(1 row)

规划器不能处理标准函数,而对我们来说,很明显1月份的航班比例将约为总航班的1/12:

SELECT count(*) AS total,count(*) FILTER (WHERE extract(month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow') = 1) AS januaryFROM flights;total | january−−−−−−−−+−−−−−−−−−214867 | 16831(1 row)

这就是表达式统计的用武之地。

扩展表达式统计

PG14引入了一种称为扩展表达式统计的特性。扩展表达式统计信息不会自动收集。要手动收集他们,请使用CREATE STATISTICS命令创建扩展统计数据库对象。

CREATE STATISTICS flights_expr ON (extract(month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow'))FROM flights;

新的统计信息会提高估算准确性:

ANALYZE flights;EXPLAIN SELECT * FROM flightsWHERE extract(month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow') = 1;QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights (cost=0.00..6384.17 rows=16222 width=63)Filter: (EXTRACT(month FROM (scheduled_departure AT TIME ZONE ...(2 rows)

要使统计信息起作用,统计信息生成命令中的表达式必须与原始查询中表达式相同。扩展统计信息元数据存储在pg_statistic_ext系统表中,而统计数据本身存储在单独的表pg_statistic_ext_data(PG12及更高版本)中。如有必要,与元数据分开存储,以限制用户访问敏感信息。

有些视图以用户友好的形式显示收集的统计信息。可以使用以下命令显示扩展表达式统计信息:

SELECT left(expr,50) || '...' AS expr,null_frac, avg_width, n_distinct,most_common_vals AS mcv,left(most_common_freqs::text,50) || '...' AS mcf,correlationFROM pg_stats_ext_exprs WHERE statistics_name = 'flights_expr' \gx-[ RECORD 1 ]−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−expr | EXTRACT(month FROM (scheduled_departure AT TIME ZO...null_frac | 0avg_width | 8n_distinct | 12mcv | {8,9,3,5,12,4,10,7,11,1,6,2}mcf | {0.12526667,0.11016667,0.07903333,0.07903333,0.078..correlation | 0.095407926

可以使用以下命令ALTER STATISTICS命令更改收集的统计数据量:

ALTER STATISTICS flights_expr SET STATISTICS 42;

表达式索引统计

正如使用常规表一样,建立表达式索引时,系统会收集其统计信息。计划者也可以使用这些统计数据。这很方便,但前提是我们真正关系索引。

DROP STATISTICS flights_expr;CREATE INDEX ON flights(extract(month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow')); => ANALYZE flights;EXPLAIN SELECT * FROM flights WHERE extract(month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow') = 1;QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Bitmap Heap Scan on flights (cost=318.42..3235.96 rows=16774 wi...Recheck Cond: (EXTRACT(month FROM (scheduled_departure AT TIME...−> Bitmap Index Scan on flights_extract_idx (cost=0.00..314.2...Index Cond: (EXTRACT(month FROM (scheduled_departure AT TI...(4 rows)

表达式索引统计信息的存储方式与表统计信息相同。例如,这是不同值的数量:

SELECT n_distinct FROM pg_statsWHERE tablename = 'flights_extract_idx';n_distinct−−−−−−−−−−−−12(1 row)

PG11及更高版本中,可以使用ALTER INDEX命令更改索引统计的准确性。您可能需要引用该表达式的列名。例如:

SELECT attname FROM pg_attributeWHERE attrelid = 'flights_extract_idx'::regclass;attname−−−−−−−−−extract(1 row)ALTER INDEX flights_extract_idxALTER COLUMN extract SET STATISTICS 42;
多元统计

PG10引入了同时从多个列收集统计信息的能力,也成为multivariate statistics。需要手动生成必应的扩展统计信息。multivariate statistics分为三种类型。

列之间的功能依赖关系

当一列中的值(完全或部分)由另一列中的值确定时,并且在查询中存在引用两列的条件,则结果基数将被低估。这是一个具有2个条件的示例:

SELECT count(*) FROM flightsWHERE flight_no = 'PG0007' AND departure_airport = 'VKO';count−−−−−−−396(1 row)

估计值明显低于应有值,只有26行:

EXPLAIN SELECT * FROM flightsWHERE flight_no = 'PG0007' AND departure_airport = 'VKO';QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Bitmap Heap Scan on flights (cost=12.03..1238.70 rows=26 width=63)Recheck Cond: (flight_no = 'PG0007'::bpchar)Filter: (departure_airport = 'VKO'::bpchar)−> Bitmap Index Scan on flights_flight_no_scheduled_departure_key(cost=0.00..12.02 rows=480 width=0)Index Cond: (flight_no = 'PG0007'::bpchar)(6 rows)

这就是臭名昭著的correlated predicates问题。规划器期望谓词独立,并计算得到的选择性集合AND作为条件选择性的乘积。应用bitmap heap scan中的departure_airport条件后,为计算flight_no条件的Bitmap索引扫描评估值显著下降。当然,航班号已经明确定义了出发机场,因此第二个条件实际上是多余的。这是扩展功能依赖统计可以帮助改进估计的地方。让我们为两列创建扩展的函数依赖统计信息:

CREATE STATISTICS flights_dep(dependencies)ON flight_no, departure_airport FROM flights;

再次分析,现在使用新的统计数据,估计得到改善:

ANALYZE flights;EXPLAIN SELECT * FROM flightsWHERE flight_no = 'PG0007' AND departure_airport = 'VKO';QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Bitmap Heap Scan on flights (cost=10.56..816.91 rows=276 width=63)Recheck Cond: (flight_no = 'PG0007'::bpchar)Filter: (departure_airport = 'VKO'::bpchar)−> Bitmap Index Scan on flights_flight_no_scheduled_departure_key(cost=0.00..10.49 rows=276 width=0)Index Cond: (flight_no = 'PG0007'::bpchar)(6 rows)

统计信息存储在系统表中,可以使用以下命令显示:

SELECT dependenciesFROM pg_stats_ext WHERE statistics_name = 'flights_dep';dependencies−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−{"2 => 5": 1.000000, "5 => 2": 0.010567}(1 row)

数字2和5是表pg_attribute中的列号。他们旁边的值表示函数依赖程度。从0(独立)到1(第2列中的值完全由第一列中的值定义)

不同值的多元数量

对于来自多列的值的不同组合数量的统计将显著提高GROUP BY对多列操作的基数。在此示例中,规划器将离开和抵达机场的对数估计值作为机场总数的平方。然而真正成对数量要低的多。因为并非每2个机场都通过直飞航班连接:

SELECT count(*) FROM (SELECT DISTINCT departure_airport, arrival_airport FROM flights) t;count−−−−−−−618(1 row)EXPLAINSELECT DISTINCT departure_airport, arrival_airport FROM flights;QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−HashAggregate (cost=5847.01..5955.16 rows=10816 width=8)Group Key: departure_airport, arrival_airport−> Seq Scan on flights (cost=0.00..4772.67 rows=214867 width=8)(3 rows)

让我们为不同值的数量创建一个统计信息:

CREATE STATISTICS flights_nd(ndistinct)ON departure_airport, arrival_airport FROM flights;ANALYZE flights;EXPLAINSELECT DISTINCT departure_airport, arrival_airport FROM flights;QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−HashAggregate (cost=5847.01..5853.19 rows=618 width=8)Group Key: departure_airport, arrival_airport−> Seq Scan on flights (cost=0.00..4772.67 rows=214867 width=8)(3 rows)

统计信息存储在系统表:

SELECT n_distinctFROM pg_stats_ext WHERE statistics_name = 'flights_nd';n_distinct−−−−−−−−−−−−−−−{"5, 6": 618}(1 row)

多元最常见值列表

当值分布不均匀时,仅功能依赖数据可能不够,因为估计值将根据特定的值对而变化。考虑这个例子,计划者错误地估计了波音737从Sheremetyevo机场起飞的航班数量:

SELECT count(*) FROM flightsWHERE departure_airport = 'SVO' AND aircraft_code = '733'count−−−−−−−2037(1 row)EXPLAIN SELECT * FROM flightsWHERE departure_airport = 'SVO' AND aircraft_code = '733';QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights (cost=0.00..5847.00 rows=733 width=63)Filter: ((departure_airport = 'SVO'::bpchar) AND (aircraft_cod...(2 rows)

我们可以使用多元MCV列表统计来改进估计:

CREATE STATISTICS flights_mcv(mcv)ON departure_airport, aircraft_code FROM flights;ANALYZE flights;EXPLAIN SELECT * FROM flightsWHERE departure_airport = 'SVO' AND aircraft_code = '733';QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights (cost=0.00..5847.00 rows=2077 width=63)Filter: ((departure_airport = 'SVO'::bpchar) AND (aircraft_cod...(2 rows)

现有系统表有频率数据供规划器使用:

SELECT values, frequencyFROM pg_statistic_ext stxJOIN pg_statistic_ext_data stxd ON stx.oid = stxd.stxoid,pg_mcv_list_items(stxdmcv) mWHERE stxname = 'flights_mcv'AND values = '{SVO,773}';values | frequency−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−{SVO,773} | 0.005733333333333333(1 row)

一个multivariate most common值列表存储default_statistics_target 值,就像一个常规 MCV 列表一样。如果参数是在列级别定义,则使用最大值。与扩展表达式统计信息一样,您可以更改列表大小(PG13及更高版本):

ALTER STATISTICS ... SET STATISTICS ...;

在这些示例中,仅为2列收集了multivariate statistics,但是可以根据需要为任意多的列收集他们。还可以将不同类型的统计信息收集到单个扩展统计信息对象中。为此,只需在创建对象时列出逗号分隔的所需统计类型。如果没有定义特定的统计类型,系统将一次收集所有可用的统计信息。

PG14在做multivariate和表达式统计时,不仅允许使用列名,也可以使用任意表达式。

原文

标签: #排列组合cn1怎么算 #排列组合cn1怎么算的 #cn1排列组合怎么计算