Extract strings from pandas df using regex-CodePudding

I need help with regex for Python Pandas dataframe. Testing strings would be:

s = pd.Series(['xslF345X03/was-form4_163347386959085.xml', 'xslF345X03/wf-form4_163347386959085.xmlasdf', 'xslF345/X03/wf-form4_163347386959085.xml'])

I would like to:

extract starting from the last '/' till the '.xml' at the end
extract only when the string ends with '.xml'

so that I get something like this:

xslF345X03/was-form4_163347386959085.xml      Extract: /was-form4_163347386959085.xml
xslF345X03/wf-form4_163347386959085.xmlasdf   Do not extract because the ending is not .xml
xslF345/X03/wf-form4_163347386959085.xml      Extract starting from the last '/' character: /wf-form4_163347386959085.xml

I figured I need following pandas code to extract using regex:

s.str.extract(...)

Thank you in advance :-)

CodePudding user response：

Use str.extract:

>>> s.str.extract(r'.*/(.*\.xml)$')
                               0
0  was-form4_163347386959085.xml
1                            NaN
2   wf-form4_163347386959085.xml

CodePudding user response：

To extract starting from the last '/' character (including the /) till end of .xml, use str.extract(), as follows:

s.str.extract(r'(/(?!.*/).*\.xml)$')

Regex Demo

Regex Details:

( - Start of capturing group for str.extract()

/ - match symbol / literally

(?!.*/) - Negative lookahead regex to assert no further symbol / after it (i.e. to ensure the symbol / is the last one)

.* - match zero or more characters

\. - match a dot literally (escape to avoid confusion with regex meta-character)

xml - match string xml literally

) - End of capturing group for str.extract()

$ - assert at end of line (to ensure the .xml is at the end)

Result:

                                0
0  /was-form4_163347386959085.xml
1                             NaN
2   /wf-form4_163347386959085.xml

CodePudding user response：

You can check with str.endswith then pass to the np.where

np.where(s.str.endswith('.xml'),s.str.rsplit('/',n=1).str[-1],np.nan)
Out[99]: 
array(['was-form4_163347386959085.xml', nan,
       'wf-form4_163347386959085.xml'], dtype=object)