I am analyzing the "explanation plan" about the following instruction
SELECT * FROM friends WHERE SUBSTR(activity,1,2) = '49';
and Oracle SQL Developer tells me that it has a cardinality of 1513 and cost of 1302.
How are these calculations performed? Could be reproduced with an instruction (calculate with a select and obtain de same value)?
CodePudding user response:
The cardinality generated by an explain plan can be based on many factors, but in your code Oracle is probably just guessing that the SUBSTR
expression will return 1% of all rows from the table.
For example, we can recreate your cardinality estimate by creating a simple table with 151,300 rows:
drop table friends;
create table friends(activity varchar2(100));
create index friends_idx on friends(activity);
insert into friends select level from dual connect by level <= 1513 * 100;
begin
dbms_stats.gather_table_stats(user, 'FRIENDS', no_invalidate => false);
end;
/
The resulting explain plan estimates the query will return 1% of the table, or 1513 rows:
explain plan for SELECT * FROM friends WHERE SUBSTR(activity,1,2) = '49';
select * from table(dbms_xplan.display);
Plan hash value: 3524934291
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1513 | 9078 | 72 (6)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| FRIENDS | 1513 | 9078 | 72 (6)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(SUBSTR("ACTIVITY",1,2)='49')
The above code is the simplest explanation, but there are potentially dozens of other weird things that are going on with your query. Running EXPLAIN PLAN FOR SELECT...
and then SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
is often enough to investigate the cardinality. Pay special attention to the "Note" section for any unexpected gotchas.
Not all of these cardinality rules and features are documented. But if you have a lot of free time, and want to understand the math behind it all, run some 10053 trace files and read Jonathan Lewis' blog and book. His book also explains how the "cost" is generated, but the calculations are so complicated that it's not worth worrying about.
Why doesn't Oracle calculate a perfect cardinality estimate?
It's too expensive to calculate actual cardinalities before running the queries. To create an always-perfect estimate for the SUBSTR
operation, Oracle would have to run something like the below query:
SELECT SUBSTR(activity,1,2), COUNT(*)
FROM friends
GROUP BY SUBSTR(activity,1,2);
For my sample data, the above query returns 99 counts, and determines that the cardinality estimate should be 1111 for the original query.
But the above query has to first read all the data from FRIENDS.ACTIVITY, which requires either an index fast full scan or a full table scan. Then the data has to be sorted or hashed to get the counts per group (which is likely an O(N*LOG(N)) operation). If the table is large, the intermediate results won't fit in memory and must be written and then read from disk.
Pre-calculating the cardinality would be more work than the actual query itself. The results could perhaps be saved, but storing those results could take up a lot of space, and how does the database know that the predicate will ever be needed again? And even if the pre-calculated cardinalities were stored, as soon as someone modifies the table those values may become worthless.
And this whole effort assumes that the functions are deterministic. While SUBSTR
works reliably, what if there was a custom function like DBMS_RANDOM.VALUE
? These problems are both theoretically impossible (the halting problem), and very difficult in practice. Instead, the optimizer relies on guesses like DBA_TABLES.NUM_ROWS (from when the statistics were last gathered) * 0.01 for "complex" predicates.
Dynamic Sampling
Dynamic sampling, also known as dynamic statistics, will pre-run parts of your SQL statement to create a better estimate. You can set the amount of data to be sampled, and by setting the value to 10, Oracle will effectively run the whole thing ahead of time to determine the cardinality. This feature can obviously be pretty slow, and there are lots of weird edge cases and other features I'm not discussing here, but for your query it can create a perfect estimate of 1,111 rows:
EXPLAIN PLAN FOR SELECT /* dynamic_sampling(10) */ * FROM friends WHERE SUBSTR(activity,1,2) = '49';
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 3524934291
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1111 | 6666 | 72 (6)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| FRIENDS | 1111 | 6666 | 72 (6)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(SUBSTR("ACTIVITY",1,2)='49')
Note
-----
- dynamic statistics used: dynamic sampling (level=10)
Dynamic Reoptimization
Oracle can keep track of the number of rows at run-time and adjust the plan accordingly. This feature doesn't help you with your simple sample query. But if the table was used as part of a join, when the cardinality estimates become more important, Oracle will build multiple versions of the explain plan and use the one depending on the actual cardinality.
In the below explain plan, you can see the estimate is still the same old 1513. But if the actual number is much lower at run time, Oracle will disable the HASH JOIN
operation meant for a large number of rows, and will switch to the NESTED LOOPS
operation that is better suited for a smaller number of rows.
EXPLAIN PLAN FOR
SELECT *
FROM friends friends1
JOIN friends friends2
ON friends1.activity = friends2.activity
WHERE SUBSTR(friends1.activity,1,2) = '49';
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY(format => ' adaptive'));
Plan hash value: 215764417
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1530 | 18360 | 143 (5)| 00:00:01 |
| * 1 | HASH JOIN | | 1530 | 18360 | 143 (5)| 00:00:01 |
|- 2 | NESTED LOOPS | | 1530 | 18360 | 143 (5)| 00:00:01 |
|- 3 | STATISTICS COLLECTOR | | | | | |
| * 4 | TABLE ACCESS FULL | FRIENDS | 1513 | 9078 | 72 (6)| 00:00:01 |
|- * 5 | INDEX RANGE SCAN | FRIENDS_IDX | 1 | 6 | 168 (2)| 00:00:01 |
| 6 | TABLE ACCESS FULL | FRIENDS | 151K| 886K| 70 (3)| 00:00:01 |
-----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("FRIENDS1"."ACTIVITY"="FRIENDS2"."ACTIVITY")
4 - filter(SUBSTR("FRIENDS1"."ACTIVITY",1,2)='49')
5 - access("FRIENDS1"."ACTIVITY"="FRIENDS2"."ACTIVITY")
Note
-----
- this is an adaptive plan (rows marked '-' are inactive)
Expression Statistics
Expression statistics tells Oracle to gather additional types of statistics. We can force Oracle to gather statistics on the SUBSTR
expression, and then those statistics can be used for more accurate estimates. In the below example, the final estimate is actually only slightly different. Expression statistics alone don't work well here, but that was just bad luck in this case.
SELECT dbms_stats.create_extended_stats(extension => '(SUBSTR(activity,1,2))', ownname => user, tabname => 'FRIENDS')
FROM DUAL;
begin
dbms_stats.gather_table_stats(user, 'FRIENDS');
end;
/
EXPLAIN PLAN FOR SELECT * FROM friends WHERE SUBSTR(activity,1,2) = '49';
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 3524934291
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1528 | 13752 | 72 (6)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| FRIENDS | 1528 | 13752 | 72 (6)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(SUBSTR("ACTIVITY",1,2)='49')
Expression Statistics and Histograms
With the addition of a histogram, we're finally creating something pretty similar to what your teacher described. When the expression statistics are gathered, a histogram will save information about the number of unique values in up to 255 different ranges or buckets. In our case, since there are only 99 unique rows, the histogram will perfectly estimate the number of rows for '49' as '1111'.
--(There are several ways to gather histograms. Instead of directly forcing it, I prefer to call the query
-- multiple times so that Oracle will register the need for a histogram, and automatically create one.)
SELECT * FROM friends WHERE SUBSTR(activity,1,2) = '49';
SELECT * FROM friends WHERE SUBSTR(activity,1,2) = '49';
begin
dbms_stats.gather_table_stats(user, 'FRIENDS');
end;
/
EXPLAIN PLAN FOR SELECT * FROM friends WHERE SUBSTR(activity,1,2) = '49';
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 3524934291
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1111 | 9999 | 72 (6)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| FRIENDS | 1111 | 9999 | 72 (6)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(SUBSTR("ACTIVITY",1,2)='49')
Summary
Oracle will not automatically pre-run all predicates to perfectly estimate cardinalities. But there are several mechanisms we can use to get Oracle to do something very similar for a small number of queries that we care about.
The situation gets even more complicated when you consider bind variables - what if the value '49' changes frequently? (Adaptive Cursor Sharing can help with that.) Or what if a huge amount of rows are modified, how do we update statistics quickly? (Online Statistics Gathering and Incremental Statistics can help with that.)
The optimizer doesn't really optimize. There's only enough time to satisfice.