pandas regex look ahead and behind from a 1st occurrence of character-CodePudding

I have python strings like below

"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"

I would like to do the below

a) extract characters that appear before and after 1st dot

b) The keywords that I want are always found after the last _ symbol

For ex: If you look at 2nd input string, I would like to get only PQRST.GHI as output. It is after last _ and before 1st . and we also get keyword after 1st .

So, I tried the below

for s in strings:
   after_part = (s.split('.')[1])
   before_part = (s.split('.')[0])
   before_part = qnd_part.split('_')[-1]
   expected_keyword = before_part   "."   after_part
   print(expected_keyword)

Though this works, this is definitely not nice and elegant way to write a regex.

Is there any other better way to write this?

I expect my output to be like as below. As you can see that we get keywords before and after 1st dot character

ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV

CodePudding user response：

You can also do it with rsplit(). Specify maxsplit, so that you don't split more than you need to (for efficiency):

[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['ABCDEF.GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

CodePudding user response：

Try (regex101):

import re

strings = [
    "1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
    "12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]

pat = re.compile(r"[^.] _([^.] \.[^.] )")

for s in strings:
    print(pat.search(s).group(1))

Prints:

ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV

CodePudding user response：

You can do (try the pattern here )

df['text'].str.extract('_([^._] \.[^.] )',expand=False)

Output:

0    ABCDEF.GHI
1     PQRST.GHI
2     JKLMN.OPQ
3       WXY.TUV
Name: text, dtype: object