Suppose I have tables A
and B
, both with over 100 million rows containing 8 years worth of data. They both have columns row_id
(=unique row identifier, e.g. row number), person_id
, dt
(=datetime). A row in A
represents a time when a user opened an email message. A row in B
represents a time a user visited a specific website.
I wish to create a table C
, which is exactly A
with a few (let's say 10) columns C_i
appended. These columns C_i
must measure, after a user opened an email (i.e. in a row in A), how many times (in a time-interval of i
hours after opening the email) he visited the website (i.e. a count of a few rows in B
). How can I efficiently construct C
?
My attempt, in Oracle PL/SQL, for adding just one column C_1
:
SELECT
any_value(A.id),
any_value(A.person_id),
any_value(A.dt),
count(1) response_1h
FROM A, B
WHERE
A.person_id = B.person_id AND
0 <= B.dt - A.dt AND B.dt-A.dt<=1/24
GROUP BY A.row_id;
I am getting an error (not for this but for a similar query), I suspect because of joining tables that are too large:
ORA-01652: unable to extend temp segment by 256 in tablespace TEMP_DEV 01652. 00000 - "unable to extend temp segment by %s in tablespace %s" *Cause: Failed to allocate an extent of the required number of blocks for a temporary segment in the tablespace indicated. *Action: Use ALTER TABLESPACE ADD DATAFILE statement to add one or more files to the tablespace indicated.
Doing it this way for 10 new columns seems far from optimal. Would anyone care to tell me how this is done properly?
CodePudding user response:
You can count the website access with conditional aggregation. Below examples looks at the ten hours after each message and count conditionally.
select
a.id, a.person_id, a.dt,
count(case when b.dt <= a.dt interval '1' hour then 1 end) as response_1h,
count(case when b.dt <= a.dt interval '2' hour then 1 end) as response_2h,
count(case when b.dt <= a.dt interval '3' hour then 1 end) as response_3h,
count(case when b.dt <= a.dt interval '4' hour then 1 end) as response_4h,
count(case when b.dt <= a.dt interval '5' hour then 1 end) as response_5h,
count(case when b.dt <= a.dt interval '6' hour then 1 end) as response_6h,
count(case when b.dt <= a.dt interval '7' hour then 1 end) as response_7h,
count(case when b.dt <= a.dt interval '8' hour then 1 end) as response_8h,
count(case when b.dt <= a.dt interval '9' hour then 1 end) as response_9h,
count(b.dt) as response_10h
from a
left join b on b.person_id = a.person_id
and b.dt > a.dt
and b.dt <= a.dt interval '10' hour
group by a.id, a.person_id, a.dt
order by a.id;
The same with a lateral join instead:
select *
from a
outer apply
(
select
count(case when b.dt <= a.dt interval '1' hour then 1 end) as response_1h,
count(case when b.dt <= a.dt interval '2' hour then 1 end) as response_2h,
count(case when b.dt <= a.dt interval '3' hour then 1 end) as response_3h,
count(case when b.dt <= a.dt interval '4' hour then 1 end) as response_4h,
count(case when b.dt <= a.dt interval '5' hour then 1 end) as response_5h,
count(case when b.dt <= a.dt interval '6' hour then 1 end) as response_6h,
count(case when b.dt <= a.dt interval '7' hour then 1 end) as response_7h,
count(case when b.dt <= a.dt interval '8' hour then 1 end) as response_8h,
count(case when b.dt <= a.dt interval '9' hour then 1 end) as response_9h,
count(b.dt) as response_10h
from b
where b.person_id = a.person_id
and b.dt > a.dt
and b.dt <= a.dt interval '10' hour
)
order by a.id;
Look at the explain plans to see whether one is supposed to run faster than the other.
For either query you want a composite index on table b in order to find its matching rows quickly. As the order of the columns in the index should matter, I'd add both possible indexes:
create index idx1 on b (person_id, dt);
create index idx2 on b (dt, person_id);
You can always drop an index that is not being used.
As to ORA-01652: I cannot answer this. Maybe you can just increase the size of the temp tablespace and be done with the problem. I don't know.