Problem:
Given a large data set (3 million rows x 6 columns) what's the fastest way to join values of columns in a single pandas data frame, based on the rows where the mask is true?
My current solution:
import pandas as pd
import numpy as np
# Note: Real data will be 3 millon rows X 6 columns,
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
'd0': ['a', 'x', 'a', '1'],
'd1': ['b', 'x', 'b', '2'],
'd2': ['c', 'x', np.nan, '3']})
#print(df)
msg_text_filter = ['msg0', 'msg2']
columns = df.columns.drop(df.columns[:3])
column_join = ["d0"]
mask = df['msg'].isin(msg_text_filter)
df.replace(np.nan,'',inplace=True)
# THIS IS SLOW, HOW TO SPEED UP?
df['d0'] = np.where(
mask,
df[['d0','d1','d2']].agg(''.join, axis=1),
df['d0']
)
df.loc[mask, columns] = np.nan
print(df)
CodePudding user response:
I suspect the slowdown is caused by df[['d0','d1','d2']].agg(''.join, axis=1)
. I'd recommend pre-computing that array, and then using the mask to do direct assignment instead of using np.where()
:
aggregated = df[['d0','d1','d2']].agg(''.join, axis=1)
df["d0"][mask] = aggregated[mask]
This is just speculation though, as I don't particularly feel like generating a bunch of random data to test the performance of this vs np.where()
. However, since you're leaving some values alone, it seems like a better idea to use the mask to assign the new values. I'd really only use np.where
if you needed to replace values for both boolean results.
That said, you should pre-compute and use the aggregated array in either case:
df['d0'] = np.where(
mask,
aggregated,
df['d0']
)
CodePudding user response:
IMHO you can save a lot of time by using
df[['d0', 'd1', 'd2']].sum(axis=1)
instead of
df[['d0', 'd1', 'd2']].agg(''.join, axis=1)
And I think instead of using np.where
you could just do:
df.loc[mask, 'd0'] = df.loc[mask, ['d0', 'd1', 'd2']].sum(axis=1)