Home > Enterprise >  How to join multiple dataframe columns based on row index to specified column?
How to join multiple dataframe columns based on row index to specified column?

Time:11-13

PROBLEM STATEMENT:

I'm trying to join multiple pandas data frame columns, based on row index, to a single column already in the data frame. Issues seem to happen when the data in a column is read in as np.nan.

EXAMPLE:

Original Data frame

time msg d0 d1 d2
0 msg0 a b c
1 msg1 x x x
2 msg0 a b c
3 msg2 1 2 3

What I want, if I were to filter for msg0 and msg2

time msg d0 d1 d2
0 msg0 abc NaN NaN
1 msg1 x x x
2 msg0 abc NaN Nan
3 msg2 123 NaN NaN

MY ATTEMPT:

df = pd.DataFrame({'time': ['0', '1', '2', '3'],
               'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
               'd0': ['a', 'x', 'a', '1'],
               'd1': ['b', 'x', 'b', '2'],
               'd2': ['c', 'x', np.nan, '3']})

mask = df.index[((df['msg'] == "msg0") |
               (df['msg'] == "msg1") |
               (df['msg'] == "msg3"))].tolist()

# Is there a better way to combine all columns after a certian point?
# This works fine here but has issues when importing large data sets.
# the 'd0' will be set to NaN too, I think this is due to np.nan
# being set to some columns values when imported.
df.loc[mask, 'd0'] = df['d0']   df['d1']   df['d2']
df.iloc[mask, 3:] = "NaN"

CodePudding user response:

The approach might be somewhat similar to @mozway's answer I will make it more detailed to be easier to follow.

1- Define your target columns and messages (just to make it easier to deal with)

# the messages to filter
msgs = ["msg0", "msg2"]
# the columns to filter
columns = df.columns.drop(['time', 'msg'])
# the column to contain the result
total_col = ["d0"]

2- Mask the rows based on the (msgs) column value

mask = df['msg'].isin(msgs)

3- Find the value of the combined values

# a- mask the dataframe to the target columns and rows. 
# b- apply ''.join() to join all the column values
# c- to join columns not rows apply on axis = 1 
new_total_col = df.loc[mask, columns].apply(lambda x: ''.join(x.dropna().astype(str)), axis=1)

4- Set all target columns and rows to np.nan and redefine the values of the "total" column

df.loc[mask, columns] = np.nan
df.loc[mask, total_col] = new_total_col

Result

 time    msg    d0  d1  d2
0   0   msg0    abc NaN NaN
1   1   msg1    x   x   x
2   2   msg0    ab  NaN NaN
3   3   msg2    123 NaN NaN

CodePudding user response:

You can use:

cols = ['d0', 'd1', 'd2']

# get the rows matching the msg condition
m = df['msg'].isin(['msg0', 'msg2'])

# get relevant columns
# concatenate the non-NaN value
# update as DataFrame to assign NaN is the non-first columns
df.loc[m, cols] = (df
   .loc[m, cols]
   .agg(lambda r: ''.join(r.dropna()), axis=1)
   .rename(cols[0]).to_frame()
 )

print(df)

Output:

  time   msg   d0   d1   d2
0    0  msg0  abc  NaN  NaN
1    1  msg1    x    x    x
2    2  msg0   ab  NaN  NaN
3    3  msg2  123  NaN  NaN
  • Related