Home > Software engineering >  Fastest way to join coulmn values in pandas dataframe?
Fastest way to join coulmn values in pandas dataframe?

Time:11-14

Problem:

Given a large data set (3 million rows x 6 columns) what's the fastest way to join values of columns in a single pandas data frame, based on the rows where the mask is true?

My current solution:

import pandas as pd
import numpy as np
  
# Note: Real data will be 3 millon rows X 6 columns,
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
               'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
               'd0': ['a', 'x', 'a', '1'],
               'd1': ['b', 'x', 'b', '2'],
               'd2': ['c', 'x', np.nan, '3']})
#print(df)

msg_text_filter = ['msg0', 'msg2']
columns = df.columns.drop(df.columns[:3])
column_join = ["d0"]

mask = df['msg'].isin(msg_text_filter)

df.replace(np.nan,'',inplace=True)
# THIS IS SLOW, HOW TO SPEED UP?
df['d0'] = np.where(
    mask,
    df[['d0','d1','d2']].agg(''.join, axis=1),
    df['d0']
)
df.loc[mask, columns] = np.nan

print(df)

CodePudding user response:

I suspect the slowdown is caused by df[['d0','d1','d2']].agg(''.join, axis=1). I'd recommend pre-computing that array, and then using the mask to do direct assignment instead of using np.where():

aggregated = df[['d0','d1','d2']].agg(''.join, axis=1)

df["d0"][mask] = aggregated[mask]

This is just speculation though, as I don't particularly feel like generating a bunch of random data to test the performance of this vs np.where(). However, since you're leaving some values alone, it seems like a better idea to use the mask to assign the new values. I'd really only use np.where if you needed to replace values for both boolean results.

That said, you should pre-compute and use the aggregated array in either case:

df['d0'] = np.where(
    mask,
    aggregated,
    df['d0']
)

CodePudding user response:

IMHO you can save a lot of time by using

df[['d0', 'd1', 'd2']].sum(axis=1)

instead of

df[['d0', 'd1', 'd2']].agg(''.join, axis=1)

And I think instead of using np.where you could just do:

df.loc[mask, 'd0'] = df.loc[mask, ['d0', 'd1', 'd2']].sum(axis=1)
  • Related