I have a list of strings like below
strings = [
"1234_4534_41247612_2462184_ASN_ABCDEF_GHI_.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_KOR_PQRST_GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_TWN_JKLMN_abcd_OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY_TUV.xlsx",
]
I would like to do the below
a) Identify ASN, KOR, IND, TWN
keyword and pick whatever comes after it
b) Identify .xlsx
keyword and pick whatever comes before it.
c) The resulting output should not start with _
but it can have underscore in between the keywords of output string (but it should not start with _
)
I tried the below (Based on inspiration from this post
regex_output = re.compile(r"\_[ASN|KOR|TWN|IND]{3}\_([a-zA-Z\_] )")
for s in strings:
print(regex_output.search(s).group(1))
How can I put a condition that it should be only looking for words before .xlsx
I expect my output to be like as shown below (exclude .xlsx keyword and _ symbol before the start of output string)
ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV
I expect my output to be like as below
ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV
CodePudding user response:
import re
result = [re.split('.xlsx', re.split('ASN|KOR|IND|TWN', i)[-1])[0].replace('_', '', 1) for i in strings]
CodePudding user response:
Try (regex101):
import re
strings = [
"1234_4534_41247612_2462184_ASN_ABCDEF_GHI_.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_KOR_PQRST_GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_TWN_JKLMN_abcd_OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY_TUV.xlsx",
]
pat = re.compile(r"(?:ASN|KOR|IND|TWN)_(.*?)_?\.xlsx")
for s in strings:
m = pat.search(s)
if m:
print(m.group(1))
Prints:
ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV
CodePudding user response:
Assuming a DataFrame/Series, you can use str.extract
with the regex _(?:ASN|KOR|TWN|IND)_([a-zA-Z_] [a-zA-Z])
:
regex = r"_(?:ASN|KOR|TWN|IND)_([a-zA-Z_] [a-zA-Z])"
df['out'] = df['col'].str.extract(regex)
output:
col out
0 1234_4534_41247612_2462184_ASN_ABCDEF_GHI.xlsx ABCDEF_GHI
1 1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184... PQRST_GHI
2 12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_... JKLMN_abcd_OPQ
3 1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_24621... WXY
4 1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_24621... WXY_TUV
Regex:
_ # underscore
(?:ASN|KOR|TWN|IND) # any of the words
_ # underscore
([a-zA-Z_] ) # capture letters/underscore