Home > database >  Pandas slice text to new column with start stop location denoted by regular expression
Pandas slice text to new column with start stop location denoted by regular expression

Time:03-24

thank you for any help in solving this. I'm new to python still, but know VBA for excel.

When extracting start and stop positions to new columns in dataframe on regex conditions, integers are returned and found without a problem. but when trying to slice the text with that code, the following error occurs

TypeError: slice indices must be integers or None or have an index method

I have the following dataframe row of data and lines of code that I have tried after many different variation attempts over the last few days. I feel this is the closest I have been abe to solve it.

|MENU_HINT                         |StartPos  |EndPos 
|AUS / Maitland (AUS) 28th Feb     |4         |22      
df2['StartPos'] = df2["MENU_HINT"].str.find('/')

df2['EndPos'] = df2["MENU_HINT"].apply(lambda x: re.search('d ', x).start() if re.search('\d ', x) else re.search('\d ', x))   1

df2['Track'] = df2['MENU_HINT'].apply(lambda x: x[x.find('/ ')   1:df2['EndPos']])

I am expecting to extract the location name starting after the '/' and exclude the date into a new column from the MENU_HINT column.

I've even tried the below code which i found also provides the end position as an integer. But when either trying to use the column or code itself for the slice, it still provides the same error

df2['Track3'] = df2["MENU_HINT"].apply(lambda x: re.search(
    '\d ', x).start() if re.search('\d ', x) else re.search('\d ', x))   1

I am trying to find the first numeric digit as this field will only have a date that i want truncated.

I'm still in learning mode but can understand concepts well, so I would like to understand why this is occurring as it seems to be a useful knowledge base for me to have as i will come across this type of extraction quite often.

Thank you for your time and efforts in helping me solve this.

Sincerely, Paul

CodePudding user response:

This issue with this line:

df2['Track'] = df2['MENU_HINT'].apply(lambda x: x[x.find('/ ') 1:df2['EndPos']])

is that you are applying the function to column MENU_HINT, which rows will be referred to in the lambda function as x, and then inside the function getting df2['EndPos'], which will return a the whole column not just an integer, and therefore can't be used for indexing. To edit this, you can apply the function to rows instead, as follows:

df['Track3'] = df.apply(lambda x: x["MENU_HINT"][x["MENU_HINT"].find('/ ')   1:x['EndPos']], axis=1)

Note here that I am passing axis=1 to apply function, which will apply the function to the whole rows of the DataFrame, giving me access to any of them.

An alternative way could be to use a regular expression directly to extract the part you want, like this:

df['Track3'] = df["MENU_HINT"].apply(lambda x:re.search(r"[A-Za-z]  / ([A-Za-z | ( | )] )", x).group(1))

Here, I am search for a string that starts with alphabetical characters, followed by a slash, then again alphabetical characters or round brackets. The output of this looks like this:

        MENU_HINT                           Track
   AUS / Maitland (AUS) 28th Feb    Maitland (AUS)
   

CodePudding user response:

If you wish to make the function longer than one line (as lambda functions can restrict in some cases), which would enable a clearer and commented manipulation of the string, then you would do this:

import pandas as pd

# sample data for dataframe
d = {'menu_hint':['AUS / Maitland (AUS) 28th Feb']}

df = pd.DataFrame(d)

print('the old dataframe:')
print(df)

def strip_word(s:str):
    ''' function to strip parts of word '''

    # n is the start position
    n = s.find('/')
    n = n 2 # adjust for spaces

    # m is the end position
    m = s.find(') ')
    m = m 1 # adjust for zero index

    s_new = s[n:m]

    return s_new 



df['Track3'] = df['menu_hint'].apply(strip_word)

print('the new dataframe:')
print(df)

The output:

                       menu_hint
0  AUS / Maitland (AUS) 28th Feb
the new dataframe:
                       menu_hint          Track3
0  AUS / Maitland (AUS) 28th Feb  Maitland (AUS)
  • Related