Home > other >  Merging data in R within the same dataframe
Merging data in R within the same dataframe

Time:01-26

Hoping there's an easy solution here. Basically, I have a very long .csv that has data that was created by multiple users, which consists of an ID, and two variables corresponding to weight before and after a procedure. To avoid human error in data entry, and since the data for the IDs wasn't generated sequentially, the data is often entered like this on the .csv (example df created):

id = c(rep(1:4,2),5:8)
pre = c(rep(10,4),rep(NA,4),rep(100,4))
post = c(rep(NA,4),rep(10,4),rep(100,4))
df = cbind(id,pre,post)

print(df)
      id pre post
 [1,]  1  10   NA
 [2,]  2  10   NA
 [3,]  3  10   NA
 [4,]  4  10   NA
 [5,]  1  NA   10
 [6,]  2  NA   10
 [7,]  3  NA   10
 [8,]  4  NA   10
 [9,]  5 100  100
[10,]  6 100  100
[11,]  7 100  100
[12,]  8 100  100

When I need the data to merge the pre and post columns together via the ID column- only if they're not already joined by coincidence (like how IDs 8-12 happen to have been inputted together). So that it looks like:

id_ = c(1:8)
pre_ = c(rep(10,4),rep(100,4))
post_ = c(rep(10,4),rep(100,4))
df_final = cbind(id_,pre_,post_)

print(df_final)
     id_ pre_ post_
[1,]   1   10    10
[2,]   2   10    10
[3,]   3   10    10
[4,]   4   10    10
[5,]   5  100   100
[6,]   6  100   100
[7,]   7  100   100
[8,]   8  100   100

I know if two sets of data are in two different dfs, I can use join_left from dplyr to merge data...unsure what to do in this situation since they're all in one .csv and not every single ID is unjoined.

Thanks.

CodePudding user response:

Your question doesn't make clear, but it sounds from your description that every id should have a pre and post value. If that works, the following would work.

Per r2evans comment above, I assume you are using the tidyverse due to your tag of tidyverse.

library(tidyverse)

id = c(rep(1:4,2),5:8)
pre = c(rep(10,4),rep(NA,4),rep(100,4))
post = c(rep(NA,4),rep(10,4),rep(100,4))
df = cbind(id,pre,post)
df<- as.data.frame(df)


df %>% left_join(., df, by = "id") %>% select(id, pre.x, post.y) %>% drop_na()

  id pre.x post.y
1  1    10     10
2  2    10     10
3  3    10     10
4  4    10     10
5  5   100    100
6  6   100    100
7  7   100    100
8  8   100    100

CodePudding user response:

We may use a group by summarise assuming that there are equal length vectors

library(dplyr)
df %>%
    group_by(id) %>%
     summarise(across(c(pre, post), ~ .x[complete.cases(.x)]))
# A tibble: 8 × 3
     id   pre  post
  <int> <dbl> <dbl>
1     1    10    10
2     2    10    10
3     3    10    10
4     4    10    10
5     5   100   100
6     6   100   100
7     7   100   100
8     8   100   100

data

df <- data.frame(id, pre, post)

CodePudding user response:

Another alternative would be to fill one of your columns to overwrite NA's with the (presumably sole) found value, then just keep one of each complete row.

df %>%
  group_by(id) %>%
  fill(pre, .direction = "downup") %>%
  ungroup() %>%
  drop_na() 
  # distinct(id, .keep_all = TRUE)  # might help if an id has both complete and incomplete records
  •  Tags:  
  • Related