I have python strings like below
"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"
I would like to do the below
a) extract characters that appear before and after 1st dot
b) The keywords that I want are always found after the last _
symbol
For ex: If you look at 2nd input string, I would like to get only PQRST.GHI
as output. It is after last _
and before 1st .
and we also get keyword after 1st .
So, I tried the below
for s in strings:
after_part = (s.split('.')[1])
before_part = (s.split('.')[0])
before_part = qnd_part.split('_')[-1]
expected_keyword = before_part "." after_part
print(expected_keyword)
Though this works, this is definitely not nice and elegant way to write a regex.
Is there any other better way to write this?
I expect my output to be like as below. As you can see that we get keywords before and after 1st dot
character
ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
CodePudding user response:
You can also do it with rsplit()
. Specify maxsplit
, so that you don't split more than you need to (for efficiency):
[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['ABCDEF.GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']
CodePudding user response:
Try (regex101):
import re
strings = [
"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]
pat = re.compile(r"[^.] _([^.] \.[^.] )")
for s in strings:
print(pat.search(s).group(1))
Prints:
ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
CodePudding user response:
You can do (try the pattern here )
df['text'].str.extract('_([^._] \.[^.] )',expand=False)
Output:
0 ABCDEF.GHI
1 PQRST.GHI
2 JKLMN.OPQ
3 WXY.TUV
Name: text, dtype: object