I have a very specific dataset it looks something like this:
record_id | event_id | instrument | repeat_inst |
---|---|---|---|
PI0005 | v03_abc_1 | NaN |
1 |
PI0005 | v03_abc_1 | i_sensor | NaN |
PI0005 | v03_abc_1 | NaN |
NaN |
PI0005 | v02_abc_33 | i_sensor | NaN |
PI0005 | v02_abc_33 | NaN |
NaN |
PI0006 | v02_abc_1 | i_sensor | 1 |
PI0006 | v02_abc_1 | NaN |
NaN |
How do I make it look like this:
record_id | event_id | instrument | repeat_inst |
---|---|---|---|
PI0005 | v03_abc_1 | i_sensor | 1 |
PI0005 | v02_abc_33 | i_sensor | NaN |
PI0006 | v02_abc_2 | i_sensor | 1 |
Where rows with the same record_id
and event_id
get merged together, where NaN
values are replaced with the other value, and if both values are NaN
, then NaN
can be kept (like in the forth and fifth row in the original dataframe).
Assume that only one of the related cells have a value and all others have NaN
.
This should apply to all columns of the data, there are thousands of columns and rows.
I tried using group by, but don't know how to continue.
CodePudding user response:
With R
library(dplyr)
df1 %>%
group_by(record_id, event_id) %>%
summarise(across(everything(), ~.x[!is.na(.x)][1]))