I have a reviews
table as follows:
r_id | my_comment |
---|---|
1 | Boxes with the TID 823 cannot exceed 40 kg |
2 | Parcel with the marking tid 63157 must not make the weight go over 31 k.g |
3 | Envelopes with TID 104124 and TID 92341 cant excel above 94.477kg |
4 | TID38204 cannot go over 45.4 kg and TID 8242602 cannot go over 92kg |
5 | Box with the TID 94514 cannot go over 52kg but also cannot go over 51KG |
I am trying to match 2 things. The TID and the weight (kg). As you can see there are a 3 things to keep in mind
- Weight is always in kg and is written case insensitively and in 2 ways,
kg
andk.g
and written in 2 ways<weight> <kg or k.g>
<weight><kg or k.g>
(one with space, one without space) - TID is written case insensitively and can be written in 2 ways
TID<id>
orTID <id>
(one with a space, one without a space. - Some comments have multiple TID and weights. I am making the assumption that the first appearance of the TID is tied to the first appearance of the weight and the the second appearance of the TID is for the second appearance of the weight. I've only gone up to 2 instances of the TID/weight but I would like it to dynamically work for any amount of instances.
So I am able to extract the TID
and the weight if the comment only has 1 weight and 1 TID. However, if it has multiple, I fail to do so. So I want to separate the multiple into different rows.
This is my desired output
r_id | tid | weight | my_comment |
---|---|---|---|
1 | 823 | 40 | Boxes with the TID 823 cannot exceed 40 kg |
2 | 63157 | 31 | Parcel with the marking tid 63157 must not make the weight go over 31 k.g |
3 | 104124 | 94.477 | Envelopes with TID 104124 and TID 92341 can't excel above 94.477kg |
3 | 92341 | Envelopes with TID 104124 and TID 92341 can't excel above 94.477kg | |
4 | 38204 | 45.4 | TID38204 cannot go over 45.4 kg and TID 8242602 cannot go over 92kg |
4 | 8242602 | 92 | TID38204 cannot go over 45.4 kg and TID 8242602 cannot go over 92kg |
5 | 94514 | 52 | Box with the TID 94514 cannot go over 52kg but also cannot go over 51KG |
5 | 51 | Box with the TID 94514 cannot go over 52kg but also cannot go over 51KG |
SQL to create table/dummy data:
CREATE TABLE reviews(
r_id number(3) NOT NULL,
my_comment VARCHAR(255) NOT NULL
);
INSERT INTO reviews (r_id, my_comment) VALUES (1, 'Boxes with the TID 823 cannot exceed 40 kg');
INSERT INTO reviews (r_id, my_comment) VALUES (2, 'Parcel with the marking tid 63157 must not make the weight go over 31 k.g');
INSERT INTO reviews (r_id, my_comment) VALUES (3, 'Envelopes with TID 104124 and TID 92341 cant excel above 94.477kg');
INSERT INTO reviews (r_id, my_comment) VALUES (4, 'TID38204 cannot go over 45.4 kg and TID 8242602 cannot go over 92kg');
INSERT INTO reviews (r_id, my_comment) VALUES (5, 'Box with the TID 94514 cannot go over 52kg but also cannot go over 51KG');
In my attempt, I am able to extract the tid and weight, but only the first instance and not able to split it into rows.
SELECT
r_id,
REGEXP_SUBSTR (
REGEXP_SUBSTR (my_comment, '(tid).*?[0-9] ', 1, 1, 'i'),
'[0-9] '
) as "tid",
REGEXP_SUBSTR (
REGEXP_SUBSTR (my_comment, '(cannot exceed|go over| excel above).*?[0-9] ?(kg|k.g)', 1, 1, 'i'),
'[0-9] '
) as "weight"
FROM reviews;
CodePudding user response:
I am able to extract the tid and weight, but only the first instance and not able to split it into rows.
Your query, modified:
- I didn't do much with what you already wrote as you seem to be satisfied with extracted
tid
andweight
- what I did change, was
regexp_substr
'soccurrence
parameter (was1
, now it iscolumn_value
)
- what I did change, was
- in order to get split data,
cross join
was added which "loops" throughmy_comment
as many times as there's greatest occurrence betweentid
andkg
(in any form)- for example, if there are 2
tid
and 1kg
, it'll "loop" 2 times - it is also used to avoid duplicates you'd get if you used only
connect by level
clause
- for example, if there are 2
You did tag the question as Oracle 10; I don't have it any more, but I know it doesn't support regexp_count
function. If that's really the case (you never answered Koen's question), then it won't work and you'll have to calculate number of tid
/weight
occurrences using some other way. I hope you aren't on 10g, though.
I ran this code in SQL*Plus. BREAK
is here just to nicely distinguish r_id
and my_comment
values, doesn't have any other purpose.
SQL> break on r_id on my_comment
SQL> SELECT r_id,
2 my_comment,
3 REGEXP_SUBSTR (REGEXP_SUBSTR (my_comment,
4 '(tid).*?[0-9] ',
5 1,
6 COLUMN_VALUE,
7 'i'),
8 '[0-9] ') AS "tid",
9 REGEXP_SUBSTR (
10 REGEXP_SUBSTR (
11 my_comment,
12 '(cannot exceed|go over| excel above).*?[0-9] ?(kg|k.g)',
13 1,
14 COLUMN_VALUE,
15 'i'),
16 '[0-9] ') AS "weight"
17 FROM reviews
18 CROSS JOIN
19 TABLE (
20 CAST (
21 MULTISET (
22 SELECT LEVEL
23 FROM DUAL
24 CONNECT BY LEVEL <= GREATEST (REGEXP_COUNT (my_comment, 'tid' , 1, 'i'),
25 REGEXP_COUNT (my_comment, '(kg|k.g)', 1, 'i')))
26 AS SYS.odcinumberlist));
which results in
R_ID MY_COMMENT tid weight
----- ------------------------------------------------------------------------- ------- -------
1 Boxes with the TID 823 cannot exceed 40 kg 823 40
2 Parcel with the marking tid 63157 must not make the weight go over 31 k.g 63157 31
3 Envelopes with TID 104124 and TID 92341 cant excel above 94.477kg 104124 94
92341
4 TID38204 cannot go over 45.4 kg and TID 8242602 cannot go over 92kg 38204 45
8242602 92
5 Box with the TID 94514 cannot go over 52kg but also cannot go over 51KG 94514 52
51
8 rows selected.
SQL>