Home > Net >  extracting a string from between to strings in dataframe
extracting a string from between to strings in dataframe

Time:11-02

im trying to extract a value from my data frame i have a column ['Desc'] it contains sentences in the folowing format

_000it_ZZZ$$$-

_0780it_ZBZT$$$-

_011it_BB$$$-

_000it_CCCC$$$-

I want to extract the string between 'it_' and '$$$'

I have tried this code but does not seem to work

# initializing substrings
sub1 = "it_"
sub2 = "$$$"
 
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
 
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1   len(sub1)   1: idx2]

I would appreciate your help

CodePudding user response:

You can use str.extract to get the desired output in your new column.

import pandas as pd
import re

df = pd.DataFrame({
    'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
})

pat = r"(?<=it_)(. )(?=[\$]{3}-)"
df['results'] = df['DESC'].str.extract(pat)
print(df)
               DESC results
0    _000it_ZZZ$$$-     ZZZ
1  _0780it_ZBZT$$$-    ZBZT
2     _011it_BB$$$-      BB
3   _000it_CCCC$$$-    CCCC
4    _000it_123$$$-     123

You can see the regex pattern on Regex101 for more details.

CodePudding user response:

You could try using a regex pattern. It matches your cases you listed here, but I can't guarantee that it will generalize to all possible patterns.

import re

string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it_)(.*)(?<!\W)")
m = p.findall(string)
print(m) # ['_ZZZ']

The pattern looks for it in the string and then stops untill it meets a non-word character.

  • Related