Home > OS >  pandas extract keywords after a list of select keywords
pandas extract keywords after a list of select keywords

Time:09-30

I have a list of strings like below

strings = [
    "1234_4534_41247612_2462184_ASN_ABCDEF_GHI_.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_KOR_PQRST_GHI.xlsx",
    "12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_TWN_JKLMN_abcd_OPQ.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY_TUV.xlsx",
]

I would like to do the below

a) Identify ASN, KOR, IND, TWN keyword and pick whatever comes after it

b) Identify .xlsx keyword and pick whatever comes before it.

c) The resulting output should not start with _ but it can have underscore in between the keywords of output string (but it should not start with _)

I tried the below (Based on inspiration from this post

regex_output = re.compile(r"\_[ASN|KOR|TWN|IND]{3}\_([a-zA-Z\_] )")

for s in strings:
    print(regex_output.search(s).group(1))

How can I put a condition that it should be only looking for words before .xlsx

I expect my output to be like as shown below (exclude .xlsx keyword and _ symbol before the start of output string)

ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV

I expect my output to be like as below

ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV

CodePudding user response:

import re
result =  [re.split('.xlsx', re.split('ASN|KOR|IND|TWN', i)[-1])[0].replace('_', '', 1) for i in strings]

CodePudding user response:

Try (regex101):

import re

strings = [
    "1234_4534_41247612_2462184_ASN_ABCDEF_GHI_.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_KOR_PQRST_GHI.xlsx",
    "12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_TWN_JKLMN_abcd_OPQ.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY_TUV.xlsx",
]

pat = re.compile(r"(?:ASN|KOR|IND|TWN)_(.*?)_?\.xlsx")

for s in strings:
    m = pat.search(s)
    if m:
        print(m.group(1))

Prints:

ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV

CodePudding user response:

Assuming a DataFrame/Series, you can use str.extract with the regex _(?:ASN|KOR|TWN|IND)_([a-zA-Z_] [a-zA-Z]):

regex = r"_(?:ASN|KOR|TWN|IND)_([a-zA-Z_] [a-zA-Z])"
df['out'] = df['col'].str.extract(regex)

output:

                                                 col             out
0     1234_4534_41247612_2462184_ASN_ABCDEF_GHI.xlsx      ABCDEF_GHI
1  1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184...       PQRST_GHI
2  12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_...  JKLMN_abcd_OPQ
3  1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_24621...             WXY
4  1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_24621...         WXY_TUV

Regex:

_                      # underscore
(?:ASN|KOR|TWN|IND)    # any of the words
_                      # underscore
([a-zA-Z_] )           # capture letters/underscore
  • Related