R Data Frames: Working with rows containing multiple values in one or more columns-CodePudding

I am working with a data set that has multiple values in several columns for each row.

I need to be able to run basic stats on this data, such as summaries and crosstabs.

About the Values

Each value is seperated with a comma, none are numeric.
There can be zero to three values in the example. Real world is zero to many options (I think 8 is the max I have seen so far)
In some cases, at least one of the columns has repeats of the same value (e.g. "Filter,Gear,Filter")

Note: In cases where a row has multiples of the same value in a column, for analysis purposes they should be made unique (e.g. "Filter,Gear~~,Filter~~")

Mock Data Example

df.orig <- structure(list(id = 1:5, Part = structure(c(4L, 16L, 11L, 5L, 
    19L), levels = c("Gear", "Gear,Gear", "Rocker", "", 
    "Piston,Hose,Piston", "Piston,Cam,Lamp", 
    "Chain,Cam,Plug", "Compressor,Pan,Cam", "Belt", "Belt,Chain", 
    "Cap,AC,Cam", "Heater", "Blade,Heater", "Plug,Filter,Filter", 
    "Bolt,Piston", "Gauge,Gauge,Bolt", "Chain,Piston,Bar", 
    "Fluid,Pan,Fluid", "Fluid,Fuse,Bolt", "Lamp"), class = "factor"), 
    Achievements = structure(c(11L, 12L, 9L, 8L, 5L), levels = c("Award,Badge,Plaque", 
    "Award,Cetificate,Plaque", "Award,Plaque,Plaque", "Badge", 
    "Badge,Badge,Cetificate", "Badge,Pin,Award", "Cetificate", 
    "Cetificate,Badge,Plaque", "Pin,Plaque,Badge", "Pin,Plaque,Plaque", 
    "Plaque", "Plaque,Award,Cetificate", "Plaque,Badge,Plaque", 
    "Plaque,Pin,Award"), class = "factor")), row.names = c(NA, 5L), 
    class = "data.frame")

What is the best method to manipulate those columns for easy analysis in R?

Is it to split on the comma, so that it creates multiple columns (as per df example below). Or is it to keep the data together with the separator (as per df above) and use R tools to do the heavy lifting?

Split on Comma

df.split <- structure(list(id = 1:5, Part_1 = c("", "Gauge", "Cap", "Piston", 
    "Fluid"), Part_2 = c("", "Gauge", "AC", "Hose", "Fuse"), Part_3 = c("", 
    "Bolt", "Cam", "Piston", "Bolt"), Ach_1 = c("Plaque", "Plaque", 
    "Pin", "Cetificate", "Badge"), Ach_2 = c("", "Award", "Plaque", 
    "Badge", "Badge"), Ach_3 = c("", "Cetificate", "Badge", "Plaque", "Cetificate")), 
    class = "data.frame", 
    row.names = c(NA, -5L))

If it is split into new columns, in my mind I am foreseeing multiple subsets, or loops and uniqued arrays to cycle though each row of data, prior to analysis. However, there may be better R packages to handle what I need to do.

I am not sure:

What are the best practices here for structuring the df in R?
What R packages are best suited to manipulating the data before running stats tools on them?

Any assitance or direction to detailed documentation would be appreciated.

CodePudding user response：

I think the best way to organize this data frame is in long format. The tidyverse functions pivot_longer, mutate and unnest will help you here:

library(tidyverse)

df.new <- df.orig %>%
  pivot_longer(Part:Achievements) %>%
  mutate(value = strsplit(as.character(value), ",")) %>%
  unnest(value) %>%
  group_by(id, name) %>%
  summarize(value = unique(value), .groups = "drop") %>%
  as.data.frame()

This gives you

df.new
#>    id         name      value
#> 1   1 Achievements     Plaque
#> 2   2 Achievements     Plaque
#> 3   2 Achievements      Award
#> 4   2 Achievements Cetificate
#> 5   2         Part      Gauge
#> 6   2         Part       Bolt
#> 7   3 Achievements        Pin
#> 8   3 Achievements     Plaque
#> 9   3 Achievements      Badge
#> 10  3         Part        Cap
#> 11  3         Part         AC
#> 12  3         Part        Cam
#> 13  4 Achievements Cetificate
#> 14  4 Achievements      Badge
#> 15  4 Achievements     Plaque
#> 16  4         Part     Piston
#> 17  4         Part       Hose
#> 18  5 Achievements      Badge
#> 19  5 Achievements Cetificate
#> 20  5         Part      Fluid
#> 21  5         Part       Fuse
#> 22  5         Part       Bolt

and

table(df.new$name, df.new$value)
              
#>                AC Award Badge Bolt Cam Cap Cetificate Fluid Fuse Gauge Hose Pin Piston Plaque
#>   Achievements  0     1     3    0   0   0          3     0    0     0    0   1      0      4
#>   Part          1     0     0    2   1   1          0     1    1     1    1   0      1      0

^{Created on 2022-10-03 with reprex v2.0.2}

CodePudding user response：

Here is one option if you want it in wide format, where we turn each cell into a vector using str_split. Then, we can use map to keep only the unique values for each cell. Then, we can use unnest_wider to put it into the wide format.

library(tidyverse)

df.orig %>%
  mutate(across(c(Part, Achievements), ~ map(str_split(.x, ","), ~ unique(.x)))) %>%
  unnest_wider(c(Part, Achievements), names_sep = "_")

Output

     id Part_1   Part_2 Part_3 Achievements_1 Achievements_2 Achievements_3
  <int> <chr>    <chr>  <chr>  <chr>          <chr>          <chr>         
1     1          NA     NA     Plaque         NA             NA            
2     2 Gauge    Bolt   NA     Plaque         Award          Cetificate    
3     3 Cap      AC     Cam    Pin            Plaque         Badge         
4     4 Piston   Hose   NA     Cetificate     Badge          Plaque        
5     5 Fluid    Fuse   Bolt   Badge          Cetificate     NA

CodePudding user response：

Here's a base R solution that converts it to wide format. you could then apply rowwise functions across groups different variables.

df.orig[df.orig==""]<-NA # NA instead of whitespace is needed here
new_df <- data.frame(id=df.orig$id) # new empty dataset

for(v in c("Part","Achievements")){
  #separate commas
  com_spl <- strsplit(as.character(df.orig[,v]), ",") #split strings
  spl_df  <- do.call(rbind.data.frame, com_spl) #create df.orig of split strings list
  names(spl_df) <- paste0(v,1:ncol(spl_df)) #rename variables

  #remove duplicates
  dupes <- t(apply(spl_df,1,duplicated)) # boolean dataframe of duplicates
  dup_idx <- which(dupes==T,arr.ind=T)  #location of duplicates
  spl_df[dup_idx] <- NA #remove duplicates

  #combine
  new_df <- cbind(new_df,spl_df) #add new vars to df.orig
}

#then e.g. unlist(new_df[2,grep("Achievements",names(new_df))])