JOIN with LIKE is taking too long-CodePudding

I have a query that is taking to long (over 6 hours). I'm trying to find for each ID the length of the teams that he is part of (the team are combination of all the ID's in the team separated by ',' ) so that later I will be able to filter on the Maximum team length (which is the most recent one ). I'm trying to do it with a JOIN with LIKE condition and it takes a lot of time.

This is the first query:

CREATE TABLE TEMP AS
SELECT I.ID ID,E.TEAM TEAM,LENGTH(E.TEAM ) LEN
FROM TEAM_TABEL I JOIN TEAM _TABEL E
ON E.TEAM LIKE '%'|| I.ID||'%'
;

This is the second query that I used to get the maximum length:

CREATE TABLE TEMP2 AS
SELECT ID,MAX(LEN) MAX_LEN
FROM TEMP
GROUP BY ID;

Is There a more efficient way to do it so it will not take so long?

An example for dataset creation:

CREATE TABLE TEAM_TABEL (
ID VARCHAR2(20),
TEAM  VARCHAR2(1000) 
);

INSERT INTO RAW_DATA (ID,TEAM)
SELECT 'x1','x1,x3,x5' FROM DUAL
UNION ALL SELECT 'x2','x2,x3,x5' FROM DUAL
UNION ALL SELECT 'x1','x1,x2,x3,x5' FROM DUAL
UNION ALL SELECT 'x4','x4' FROM DUAL
UNION ALL SELECT 'x5','x1,x2,x3,x' FROM DUAL
UNION ALL SELECT 'z3','z3' FROM DUAL

CodePudding user response：

You have a table consisting of two columns: ID and team. It sounds like for every team you simply want the longest team string. This is one result row per ID, which translates to GROUP BY id in SQL.

To get the team with the maximum length you can use Oracle's KEEP FIRST/LAST.

select id, max(team) keep (dense_rank last order by length(team)) as max_length_team
from mytable
group by id
order by id;

CodePudding user response：

As indicated in the comments, I think the best thing to do is redesign the table structure. And if that's not possible, the most realistic thing to improve your query is to create a special index on your column so that you can use contains like described here.

However, I think it is possible that something like the following will be faster. (If this does help you, note that I borrowed the keep strategy from Thorsten Kettner's answer.)

SELECT t1.id, max(t2.team) keep (dense_rank last order by length(t2.team)) as max_length_team
FROM team_table t1 INNER JOIN (
 SELECT ID, regexp_substr(team, '[^,] ', 1, LEVEL) as team_breakdown, team
 FROM team_table
 CONNECT BY regexp_substr(team, '[^,] ', 1, LEVEL) IS NOT NULL
) t2
  ON t1.id = t2.team_breakdown
GROUP BY t1.id

Or maybe:

SELECT team_breakdown, max(team) keep (dense_rank last order by length(team)) as max_length_team
FROM (
 SELECT ID, regexp_substr(team, '[^,] ', 1, LEVEL) as team_breakdown, team
 FROM team_table
 CONNECT BY regexp_substr(team, '[^,] ', 1, LEVEL) IS NOT NULL
)
WHERE team_breakdown IN (SELECT id FROM team_table) -- to make it functionally equivalent to the join... depending on what you know about your data, this might not be necessary
GROUP BY team_breakdown

Without knowing a lot of statistics about your actual data, I'm not confident that this will necessarily be faster, but I think it might be.

The idea in both the examples above is to break your list into rows, so that your DB can act as if the data is better normalized.

Each comma-delimited list will need to be scanned once to break it into rows and there is overhead here, but it seems possible that your like join is scanning each list many times. It might be faster to do column comparisons, notwithstanding the overhead of creating them.

If this doesn't improve things in memory, you could use the subquery above to actually temporarily store the data in a real table or a materialized view with a proper index. If this query takes 6 hours as you say, it might be worth doing something like that (if you really need to keep the delimited lists.)