Context I'm trying to apply several functions from a list of function objects into a specific dataframe column, however, I keep getting this error "ValueError: Columns must be same length as key"
possible_message_names = ['x','y','z']
path_of_the_directory= r'{}'.format(path_of_the_directory)
processing_list = [remove_whitespace,convert_to_unicode]
for root, dirs, files in os.walk(path_of_the_directory):
print("Normalizing the files in the directory: {}".format(root))
for individual_file in tqdm(files):
dataframe = pd.DataFrame(pd.read_excel(os.path.join(root, individual_file)))
for possible_column_name in possible_message_names:
if possible_column_name in dataframe.columns:
dataframe[possible_column_name] = dataframe[possible_column_name].apply(lambda text: method(text) if type(text) == str else text for method in processing_list )
dataframe.to_excel('{}\\normalized_{}'.format(root,individual_file), index=False)
Any help is very welcome
P.S. I'm trying to normalize unicode (hence the convert_to_unicode function in the list)
EDIT: I notice that doing
dataframe[possible_column_name].apply(lambda text: method(text) if type(text) == str else next for method in processing_list )
instead of
dataframe[possible_column_name] = dataframe[possible_column_name].apply(lambda text: method(text) if type(text) == str else next for method in processing_list )
solves this error, but the functions aren't being applied this way...
Something like this seems to work:
for method in processing_list : #iterates over the methods added by the user in the pipeline and applies to the column to be cleaned
if callable(method): #if the method is a function object
dataframe[possible_column_name ] = dataframe[possible_column_name ].apply(method)
CodePudding user response:
Instead of looping over the methods in the apply
argument, loop over them in the script and then apply the method to all rows.
This will accumulate all the modifications rather than returning a generator of calls on the original text.
for possible_column_name in possible_message_names:
if possible_column_name in dataframe.columns:
for method in processing_list:
dataframe[possible_column_name] = dataframe[possible_column_name].apply(lambda text: method(text) if type(text) == str else text)