I've got a pandas DataFrame that contains NumPy arrays in some columns:
import numpy as np, pandas as pd
data = {'col1': [np.array([1, 2]), np.array([3, 4])],
'col2': [np.array([5, 6]), np.array([7, 8])],
'col3': [9, 10]}
df = pd.DataFrame(data)
I need to store a large frame like this one in a CSV file, but the arrays have to be strings that look like this:
col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10
What I'm currently doing to achieve this result is to iterate over each column and each row of the DataFrame, but my solution doesn't seem efficient.
This is my current solution:
pd.options.mode.chained_assignment = None
array_columns = [column for column in df.columns if isinstance(df[column].iloc[0], np.ndarray)]
for index, row in df.iterrows():
for column in array_columns:
# Here 'tuple' is only used to replace brackets for parenthesis
df[column][index] = str(tuple(row[column]))
I tried using apply, although I've heard it's usually not an efficient alternative:
def array_to_str(array):
return str(tuple(array))
df[array_columns] = df[array_columns].apply(array_to_str)
But my arrays become NaN
:
col1 col2 col3
0 NaN NaN 9
1 NaN NaN 10
I tried other similar solutions, but the error:
ValueError: Must have equal len keys and value when setting with an iterable
appeared quite often.
Is there a more efficient way of performing the same operation? My real dataframes can contain many columns and thousands of rows.
CodePudding user response:
Try this:
tupcols = ['col1', 'col2']
df[tupcols] = df[tupcols].apply(lambda col: col.apply(tuple)).astype('str')
df.to_csv()
CodePudding user response:
You would need to convert the arrays into tuple
for the correct representation. In order to do so, you can apply tuple
function on columns with object
dtype.
to_save = df.apply(lambda x: x.map(lambda y: tuple(y)) if x.dtype=='object' else x)
to_save.to_csv(index=False)
Output:
col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10
Note: This would be dangerous if you have other columns, e.g. string type.
CodePudding user response:
data = {'col1': [np.array([1, 2]), np.array([3, 4])],
'col2': [np.array([5, 6]), np.array([7, 8])],
'col3': [9, 10]}
df = pd.DataFrame(data)
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: tuple(x))
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: ''' "{}" '''.format(x))
col1 col2 col3
0 "(1, 2)" "(5, 6)" 9
1 "(3, 4)" "(7, 8)" 10