Merging data in R within the same dataframe-CodePudding

Hoping there's an easy solution here. Basically, I have a very long .csv that has data that was created by multiple users, which consists of an ID, and two variables corresponding to weight before and after a procedure. To avoid human error in data entry, and since the data for the IDs wasn't generated sequentially, the data is often entered like this on the .csv (example df created):

id = c(rep(1:4,2),5:8)
pre = c(rep(10,4),rep(NA,4),rep(100,4))
post = c(rep(NA,4),rep(10,4),rep(100,4))
df = cbind(id,pre,post)

print(df)
      id pre post
 [1,]  1  10   NA
 [2,]  2  10   NA
 [3,]  3  10   NA
 [4,]  4  10   NA
 [5,]  1  NA   10
 [6,]  2  NA   10
 [7,]  3  NA   10
 [8,]  4  NA   10
 [9,]  5 100  100
[10,]  6 100  100
[11,]  7 100  100
[12,]  8 100  100

When I need the data to merge the pre and post columns together via the ID column- only if they're not already joined by coincidence (like how IDs 8-12 happen to have been inputted together). So that it looks like:

id_ = c(1:8)
pre_ = c(rep(10,4),rep(100,4))
post_ = c(rep(10,4),rep(100,4))
df_final = cbind(id_,pre_,post_)

print(df_final)
     id_ pre_ post_
[1,]   1   10    10
[2,]   2   10    10
[3,]   3   10    10
[4,]   4   10    10
[5,]   5  100   100
[6,]   6  100   100
[7,]   7  100   100
[8,]   8  100   100

I know if two sets of data are in two different dfs, I can use join_left from dplyr to merge data...unsure what to do in this situation since they're all in one .csv and not every single ID is unjoined.

Thanks.

CodePudding user response：

Your question doesn't make clear, but it sounds from your description that every id should have a pre and post value. If that works, the following would work.

Per r2evans comment above, I assume you are using the tidyverse due to your tag of tidyverse.

library(tidyverse)

id = c(rep(1:4,2),5:8)
pre = c(rep(10,4),rep(NA,4),rep(100,4))
post = c(rep(NA,4),rep(10,4),rep(100,4))
df = cbind(id,pre,post)
df<- as.data.frame(df)


df %>% left_join(., df, by = "id") %>% select(id, pre.x, post.y) %>% drop_na()

  id pre.x post.y
1  1    10     10
2  2    10     10
3  3    10     10
4  4    10     10
5  5   100    100
6  6   100    100
7  7   100    100
8  8   100    100

CodePudding user response：

We may use a group by summarise assuming that there are equal length vectors

library(dplyr)
df %>%
    group_by(id) %>%
     summarise(across(c(pre, post), ~ .x[complete.cases(.x)]))
# A tibble: 8 × 3
     id   pre  post
  <int> <dbl> <dbl>
1     1    10    10
2     2    10    10
3     3    10    10
4     4    10    10
5     5   100   100
6     6   100   100
7     7   100   100
8     8   100   100

data

df <- data.frame(id, pre, post)

CodePudding user response：

Another alternative would be to fill one of your columns to overwrite NA's with the (presumably sole) found value, then just keep one of each complete row.

df %>%
  group_by(id) %>%
  fill(pre, .direction = "downup") %>%
  ungroup() %>%
  drop_na() 
  # distinct(id, .keep_all = TRUE)  # might help if an id has both complete and incomplete records