I have a column in my dataframe with below values
A123R221343
A12323Q123213
L122F898
There is always 2 alphabets in text, first character and 2nd alphabet could be in 4th,5th,6th or 7th character.
I would like to derive a new column in pyspark with only digits in between them
123
12323
122
I tried regex [A-Za-z].*[A-Za-z]
& [\d].*[A-Za-z]
but its getting me the alphabets also which I do not want. I'm completely new with regex
CodePudding user response:
Using [A-Za-z].*[A-Za-z]
will match any character from the first occurrence of [A-Za-z]
till the last occurrence of [A-Za-z]
Using [\d].*[A-Za-z]
does the same, only starting with a digit and does not make sure that there is a char A-Za-z before it.
What you can do is capture only digits in a capture group between 2 matches:
[A-Za-z](\d )[A-Za-z]
See a regex demo