How to perform operations over arrays in a pandas dataframe efficiently?-CodePudding

I've got a pandas DataFrame that contains NumPy arrays in some columns:

import numpy as np, pandas as pd

data = {'col1': [np.array([1, 2]), np.array([3, 4])],
        'col2': [np.array([5, 6]), np.array([7, 8])],
        'col3': [9, 10]}

df = pd.DataFrame(data)

I need to store a large frame like this one in a CSV file, but the arrays have to be strings that look like this:

col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10

What I'm currently doing to achieve this result is to iterate over each column and each row of the DataFrame, but my solution doesn't seem efficient.

This is my current solution:

pd.options.mode.chained_assignment = None
array_columns = [column for column in df.columns if isinstance(df[column].iloc[0], np.ndarray)]

for index, row in df.iterrows():
    for column in array_columns:
        # Here 'tuple' is only used to replace brackets for parenthesis
        df[column][index] = str(tuple(row[column]))

I tried using apply, although I've heard it's usually not an efficient alternative:

def array_to_str(array):
    return str(tuple(array))

df[array_columns] = df[array_columns].apply(array_to_str)

But my arrays become NaN:

   col1  col2  col3
0   NaN   NaN     9
1   NaN   NaN    10

I tried other similar solutions, but the error:

ValueError: Must have equal len keys and value when setting with an iterable

appeared quite often.

Is there a more efficient way of performing the same operation? My real dataframes can contain many columns and thousands of rows.

CodePudding user response：

Try this:

tupcols = ['col1', 'col2']
df[tupcols] = df[tupcols].apply(lambda col: col.apply(tuple)).astype('str')
df.to_csv()

CodePudding user response：

You would need to convert the arrays into tuple for the correct representation. In order to do so, you can apply tuple function on columns with object dtype.

to_save = df.apply(lambda x: x.map(lambda y: tuple(y)) if x.dtype=='object' else x)

to_save.to_csv(index=False)

Output:

col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10

Note: This would be dangerous if you have other columns, e.g. string type.

CodePudding user response：

data = {'col1': [np.array([1, 2]), np.array([3, 4])],
        'col2': [np.array([5, 6]), np.array([7, 8])],
        'col3': [9, 10]}

df = pd.DataFrame(data)
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: tuple(x))
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: ''' "{}" '''.format(x))

         col1        col2  col3
0   "(1, 2)"    "(5, 6)"      9
1   "(3, 4)"    "(7, 8)"     10