Home > Enterprise >  Collapsing rows based on grouping variable and removing empty entries
Collapsing rows based on grouping variable and removing empty entries

Time:07-01

I have a data set that looks something like this

user_id <- c(12345,12345,12345,12345,12345,12356,12356,12356,12356,12356)
r1 <- c(1,NA,NA,NA,NA,1,NA,NA,NA,NA)
r2 <- c(NA,1,NA,NA,NA,NA,1,NA,NA,NA)
r3 <- c(NA,NA,1,NA,NA,NA,NA,1,NA,NA)
r4 <- c(NA,NA,NA,1,NA,NA,NA,NA,1,NA)
r5 <- c(NA,NA,NA,NA,1,NA,NA,NA,NA,1)

df <- data.frame(user_id,r1,r2,r3,r4,r5)

I would like to be able to remove the blank spaces (my data actually has NA's in these spaces, so that each user ID has only one row, with all of the ratings on the same row. I have tried using this post to figure things out but I would like to keep it as a data frame etc. The process from the original post hasn't been working for me so far. Ideally, I would like to stick to the tidyverse but at this point I am just trying to make it work.

CodePudding user response:

Pivoting it to long format is usually helpful here, as it allows us to work more easily with the grouping variable; in this case, removing any lines which have "" and then pivoting to a wide format again.

library(dplyr)
library(tidyr)

df %>% 
    pivot_longer(-user_id) %>% 
    filter(value != "") %>% 
    pivot_wider(names_from="name")

In the case that the values are NA change the filter1 command to filter(!value %in% NA),

# A tibble: 2 × 6
  user_id r1    r2    r3    r4    r5   
    <dbl> <chr> <chr> <chr> <chr> <chr>
1   12345 1     1     1     1     1    
2   12356 1     1     1     1     1   

CodePudding user response:

Another option using gather and spread where you first convert the empty cells to NA:

user_id <- c(12345,12345,12345,12345,12345,12356,12356,12356,12356,12356)
r1 <- c(1,'','','','',1,'','','','')
r2 <- c('',1,'','','','',1,'','','')
r3 <- c('','',1,'','','','',1,'','')
r4 <- c('','','',1,'','','','',1,'')
r5 <- c('','','','',1,'','','','',1)
df <- data.frame(user_id,r1,r2,r3,r4,r5)

library(dplyr)
library(tidyr)
df %>% 
  na_if("") %>%
  gather(key, value, -user_id) %>% 
  na.omit() %>% 
  spread(key, value)
#>   user_id r1 r2 r3 r4 r5
#> 1   12345  1  1  1  1  1
#> 2   12356  1  1  1  1  1

Created on 2022-06-30 by the reprex package (v2.0.1)

CodePudding user response:

Using base R, assuming your actual data has the same regularity:

df |> 
  stack() |>
  subset(values != "" & (ind != "user_id" | !duplicated(values))) |>
  unstack()


#   user_id r1 r2 r3 r4 r5
# 1   12345  1  1  1  1  1
# 2   12356  1  1  1  1  1
  • Related