Thanks for reading and (hopefully) helping out! I'm stumped by pandas apply. I'm using it on a regex function that works perfectly fine on an ordinary string, but when I use it on a dataframe it outputs simply the same cell value. Here's the function:
def match_pattern(df_cell):
if type(df_cell) == str:
result = re.search(r'(?:[0-9]{1,4}\s)(.*)(?=\nName)', df_cell)
if result:
print('result.group(1)',result.group(1))
return result.group(1)
else:
print('no result')
return df_cell
else:
return df_cell
now this works nicely on a string. for example:
string = '3971 Small Arms Survey\nName'
string2 = 'nothing here'
match_pattern(string) # outputs 'Small Arms Survey' which is what i want
match_pattern(string2) # outputs 'nothing here'
but does not seem to work when i use this on a dataframe with apply
frame = pd.DataFrame(['3971 Small Arms Survey\nName'])
frame2 = frame.apply(lambda x: match_pattern(str(x)))
frame2 # outputs '3971 Small Arms Survey\nName'
i would try other things like iterrows or itertuples etc. but ultimately this regex function is supposed to be used on every cell of a large dataframe and anything slower than apply is hardly feasible.
the print statements in the match_pattern()
function are merely for debugging. in case you're wondering, the print('result.group(1)',result.group(1))
string is triggered in both: the application on 'string' and the application on the dataframe. however the printouts are not the same. in both cases the printout is what the function returns, and in the case of the dataframe that is simply the string that was in the dataframe to begin with, whereas for the string the printout is the string i want to be filtered (i.e. group(1) in the regex expression inside the function).
many thanks to Wiktor Stribiżew whose comment answered my question! turns out it was a simple, dumb error. using apply on the column of the dataframe will work:
frame = frame[0].apply(match_pattern) # outputs 'Small Arms Survey' for the cell, which is what i want
CodePudding user response:
You can run apply
on the the 0
th column:
import re
import pandas as pd
def match_pattern(df_cell):
if isinstance(df_cell, str):
result = re.search(r'[0-9]{1,4}\s(.*)\nName', df_cell)
if result:
print('result.group(1)',result.group(1))
return result.group(1)
else:
print('no result')
return df_cell
else:
return df_cell
frame = pd.DataFrame(['3971 Small Arms Survey\nName'])
frame[0] = frame[0].apply(match_pattern)
# => frame
# 0
# 0 Small Arms Survey
Note I reduced the regex to [0-9]{1,4}\s(.*)\nName
, as all you need is text captured into Group 1.
Also, if isinstance(df_cell, str):
IMHO looks tidier to check the type of the df_cell
.