Home > Back-end >  pandas regex look ahead and behind from a 1st occurrence of character
pandas regex look ahead and behind from a 1st occurrence of character

Time:09-29

I have python strings like below

"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"

I would like to do the below

a) extract characters that appear before and after 1st dot

b) The keywords that I want are always found after the last _ symbol

For ex: If you look at 2nd input string, I would like to get only PQRST.GHI as output. It is after last _ and before 1st . and we also get keyword after 1st .

So, I tried the below

for s in strings:
   after_part = (s.split('.')[1])
   before_part = (s.split('.')[0])
   before_part = qnd_part.split('_')[-1]
   expected_keyword = before_part   "."   after_part
   print(expected_keyword)

Though this works, this is definitely not nice and elegant way to write a regex.

Is there any other better way to write this?

I expect my output to be like as below. As you can see that we get keywords before and after 1st dot character

ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV

CodePudding user response:

You can also do it with rsplit(). Specify maxsplit, so that you don't split more than you need to (for efficiency):

[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['ABCDEF.GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

CodePudding user response:

Try (regex101):

import re

strings = [
    "1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
    "12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]

pat = re.compile(r"[^.] _([^.] \.[^.] )")

for s in strings:
    print(pat.search(s).group(1))

Prints:

ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV

CodePudding user response:

You can do (try the pattern here )

df['text'].str.extract('_([^._] \.[^.] )',expand=False)

Output:

0    ABCDEF.GHI
1     PQRST.GHI
2     JKLMN.OPQ
3       WXY.TUV
Name: text, dtype: object
  • Related