I need help with regex for Python Pandas dataframe. Testing strings would be:
s = pd.Series(['xslF345X03/was-form4_163347386959085.xml', 'xslF345X03/wf-form4_163347386959085.xmlasdf', 'xslF345/X03/wf-form4_163347386959085.xml'])
I would like to:
- extract starting from the last '/' till the '.xml' at the end
- extract only when the string ends with '.xml'
so that I get something like this:
xslF345X03/was-form4_163347386959085.xml Extract: /was-form4_163347386959085.xml
xslF345X03/wf-form4_163347386959085.xmlasdf Do not extract because the ending is not .xml
xslF345/X03/wf-form4_163347386959085.xml Extract starting from the last '/' character: /wf-form4_163347386959085.xml
I figured I need following pandas code to extract using regex:
s.str.extract(...)
Thank you in advance :-)
CodePudding user response:
Use str.extract
:
>>> s.str.extract(r'.*/(.*\.xml)$')
0
0 was-form4_163347386959085.xml
1 NaN
2 wf-form4_163347386959085.xml
CodePudding user response:
To extract starting from the last '/' character (including the /
) till end of .xml
, use str.extract()
, as follows:
s.str.extract(r'(/(?!.*/).*\.xml)$')
Regex Details:
(
- Start of capturing group for str.extract()
/
- match symbol / literally
(?!.*/)
- Negative lookahead regex to assert no further symbol / after it (i.e. to ensure the symbol / is the last one)
.*
- match zero or more characters
\.
- match a dot literally (escape to avoid confusion with regex meta-character)
xml
- match string xml
literally
)
- End of capturing group for str.extract()
$
- assert at end of line (to ensure the .xml
is at the end)
Result:
0
0 /was-form4_163347386959085.xml
1 NaN
2 /wf-form4_163347386959085.xml
CodePudding user response:
You can check with str.endswith
then pass to the np.where
np.where(s.str.endswith('.xml'),s.str.rsplit('/',n=1).str[-1],np.nan)
Out[99]:
array(['was-form4_163347386959085.xml', nan,
'wf-form4_163347386959085.xml'], dtype=object)