PROBLEM STATEMENT:
I'm trying to join multiple pandas data frame columns, based on row index, to a single column already in the data frame. Issues seem to happen when the data in a column is read in as np.nan.
EXAMPLE:
Original Data frame
time | msg | d0 | d1 | d2 |
---|---|---|---|---|
0 | msg0 | a | b | c |
1 | msg1 | x | x | x |
2 | msg0 | a | b | c |
3 | msg2 | 1 | 2 | 3 |
What I want, if I were to filter for msg0 and msg2
time | msg | d0 | d1 | d2 |
---|---|---|---|---|
0 | msg0 | abc | NaN | NaN |
1 | msg1 | x | x | x |
2 | msg0 | abc | NaN | Nan |
3 | msg2 | 123 | NaN | NaN |
MY ATTEMPT:
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
'd0': ['a', 'x', 'a', '1'],
'd1': ['b', 'x', 'b', '2'],
'd2': ['c', 'x', np.nan, '3']})
mask = df.index[((df['msg'] == "msg0") |
(df['msg'] == "msg1") |
(df['msg'] == "msg3"))].tolist()
# Is there a better way to combine all columns after a certian point?
# This works fine here but has issues when importing large data sets.
# the 'd0' will be set to NaN too, I think this is due to np.nan
# being set to some columns values when imported.
df.loc[mask, 'd0'] = df['d0'] df['d1'] df['d2']
df.iloc[mask, 3:] = "NaN"
CodePudding user response:
The approach might be somewhat similar to @mozway's answer I will make it more detailed to be easier to follow.
1- Define your target columns and messages (just to make it easier to deal with)
# the messages to filter
msgs = ["msg0", "msg2"]
# the columns to filter
columns = df.columns.drop(['time', 'msg'])
# the column to contain the result
total_col = ["d0"]
2- Mask the rows based on the (msgs) column value
mask = df['msg'].isin(msgs)
3- Find the value of the combined values
# a- mask the dataframe to the target columns and rows.
# b- apply ''.join() to join all the column values
# c- to join columns not rows apply on axis = 1
new_total_col = df.loc[mask, columns].apply(lambda x: ''.join(x.dropna().astype(str)), axis=1)
4- Set all target columns and rows to np.nan and redefine the values of the "total" column
df.loc[mask, columns] = np.nan
df.loc[mask, total_col] = new_total_col
Result
time msg d0 d1 d2
0 0 msg0 abc NaN NaN
1 1 msg1 x x x
2 2 msg0 ab NaN NaN
3 3 msg2 123 NaN NaN
CodePudding user response:
You can use:
cols = ['d0', 'd1', 'd2']
# get the rows matching the msg condition
m = df['msg'].isin(['msg0', 'msg2'])
# get relevant columns
# concatenate the non-NaN value
# update as DataFrame to assign NaN is the non-first columns
df.loc[m, cols] = (df
.loc[m, cols]
.agg(lambda r: ''.join(r.dropna()), axis=1)
.rename(cols[0]).to_frame()
)
print(df)
Output:
time msg d0 d1 d2
0 0 msg0 abc NaN NaN
1 1 msg1 x x x
2 2 msg0 ab NaN NaN
3 3 msg2 123 NaN NaN