I have a dataframe that looks like this:
id col1
1 ACC 12-34-11-123-122-A
2 ACC TASKS 12-34-11-123-122-B
3 ABB 12-34-11-123-122-C
I want to extract the code from the first and second lines (12-34-11-123-122-A
, 12-34-11-123-122-B
) which have ACC
before them.
I found this answer and this is my attempt:
F.regexp_extract(F.col("col_1"), r'(.)(ACC)(\s )(\b\d{2}\-\d{2}\-\d{2}\-\d{3}\-[A-Z0-9]{0,3}\b)', 4)
I have to add the second group (ACC)
because the ABB
code has the same format.
How can I fix my regex to extract both ACC
and ACC TASKS
from this dataframe?
CodePudding user response:
You may use this regex:
(\bACC(?:\s TASKS)?)\s (\d{2}-\d{2}-\d{2}-\d{3}-[A-Z0-9]{0,3})
Here (\bACC(?:\s TASKS)?)
matches ACC
or ACC TASKS
before matching a given pattern.
For your python code:
F.regexp_extract(F.col("col_1"), r'(\bACC(?:\s TASKS)?)\s (\d{2}-\d{2}-\d{2}-\d{3}-[A-Z0-9]{0,3})', 4)
CodePudding user response:
With your shown samples, please try following regex.
^ACC(?:\s TASKS)?\s \d{2}(?:-\d{2}){2}-\d{3}-[A-Z0-9]{0,3}(?=-[A-Z]$)
Explanation: Adding detailed explanation for above regex.
^ACC(?:\s TASKS)?\s ##Matching from starting of value ACC followed by spaces(1 or more occurrences) followed by TASKS and
##keeping this non-capturing group as optional, followed by spaces(1 or more occurrences).
\d{2}(?:-\d{2}){2}-\d{3}- ##Matching 2 digits followed by a non-capturing group which matches 2 occurrences of -followed by 2 digits;
##non-capturing group is further followed by - and 3 digits -
[A-Z0-9]{0,3}(?=-[A-Z]$) ##Matching capital A to Z OR 0-9 from 0 to 3 occurrences then making sure this is being
##followed by a dash and capital A to Z at end of line/value.