PySpark regex to extract string with two conditions-CodePudding

I have a dataframe that looks like this:

id  col1
1   ACC 12-34-11-123-122-A
2   ACC TASKS 12-34-11-123-122-B
3   ABB 12-34-11-123-122-C

I want to extract the code from the first and second lines (12-34-11-123-122-A, 12-34-11-123-122-B) which have ACC before them.

I found this answer and this is my attempt:

F.regexp_extract(F.col("col_1"), r'(.)(ACC)(\s )(\b\d{2}\-\d{2}\-\d{2}\-\d{3}\-[A-Z0-9]{0,3}\b)', 4)

I have to add the second group (ACC) because the ABB code has the same format.

How can I fix my regex to extract both ACC and ACC TASKS from this dataframe?

CodePudding user response：

You may use this regex:

(\bACC(?:\s TASKS)?)\s (\d{2}-\d{2}-\d{2}-\d{3}-[A-Z0-9]{0,3})

RegEx Demo

Here (\bACC(?:\s TASKS)?) matches ACC or ACC TASKS before matching a given pattern.

For your python code:

F.regexp_extract(F.col("col_1"), r'(\bACC(?:\s TASKS)?)\s (\d{2}-\d{2}-\d{2}-\d{3}-[A-Z0-9]{0,3})', 4)

CodePudding user response：

With your shown samples, please try following regex.

^ACC(?:\s TASKS)?\s \d{2}(?:-\d{2}){2}-\d{3}-[A-Z0-9]{0,3}(?=-[A-Z]$)

Online demo for above regex

Explanation: Adding detailed explanation for above regex.

^ACC(?:\s TASKS)?\s         ##Matching from starting of value ACC followed by spaces(1 or more occurrences) followed by TASKS and
                            ##keeping this non-capturing group as optional, followed by spaces(1 or more occurrences).
\d{2}(?:-\d{2}){2}-\d{3}-   ##Matching 2 digits followed by a non-capturing group which matches 2 occurrences of -followed by 2 digits;
                            ##non-capturing group is further followed by - and 3 digits -
[A-Z0-9]{0,3}(?=-[A-Z]$)    ##Matching capital A to Z OR 0-9 from 0 to 3 occurrences then making sure this is being
                            ##followed by a dash and capital A to Z at end of line/value.