I am working with a data set that has multiple values in several columns for each row.
I need to be able to run basic stats on this data, such as summaries and crosstabs.
About the Values
- Each value is seperated with a comma, none are numeric.
- There can be zero to three values in the example. Real world is zero to many options (I think 8 is the max I have seen so far)
- In some cases, at least one of the columns has repeats of the same value (e.g. "Filter,Gear,Filter")
Note: In cases where a row has multiples of the same value in a column, for analysis purposes they should be made unique (e.g. "Filter,Gear,Filter")
Mock Data Example
df.orig <- structure(list(id = 1:5, Part = structure(c(4L, 16L, 11L, 5L,
19L), levels = c("Gear", "Gear,Gear", "Rocker", "",
"Piston,Hose,Piston", "Piston,Cam,Lamp",
"Chain,Cam,Plug", "Compressor,Pan,Cam", "Belt", "Belt,Chain",
"Cap,AC,Cam", "Heater", "Blade,Heater", "Plug,Filter,Filter",
"Bolt,Piston", "Gauge,Gauge,Bolt", "Chain,Piston,Bar",
"Fluid,Pan,Fluid", "Fluid,Fuse,Bolt", "Lamp"), class = "factor"),
Achievements = structure(c(11L, 12L, 9L, 8L, 5L), levels = c("Award,Badge,Plaque",
"Award,Cetificate,Plaque", "Award,Plaque,Plaque", "Badge",
"Badge,Badge,Cetificate", "Badge,Pin,Award", "Cetificate",
"Cetificate,Badge,Plaque", "Pin,Plaque,Badge", "Pin,Plaque,Plaque",
"Plaque", "Plaque,Award,Cetificate", "Plaque,Badge,Plaque",
"Plaque,Pin,Award"), class = "factor")), row.names = c(NA, 5L),
class = "data.frame")
What is the best method to manipulate those columns for easy analysis in R?
Is it to split on the comma, so that it creates multiple columns (as per df example below). Or is it to keep the data together with the separator (as per df above) and use R tools to do the heavy lifting?
Split on Comma
df.split <- structure(list(id = 1:5, Part_1 = c("", "Gauge", "Cap", "Piston",
"Fluid"), Part_2 = c("", "Gauge", "AC", "Hose", "Fuse"), Part_3 = c("",
"Bolt", "Cam", "Piston", "Bolt"), Ach_1 = c("Plaque", "Plaque",
"Pin", "Cetificate", "Badge"), Ach_2 = c("", "Award", "Plaque",
"Badge", "Badge"), Ach_3 = c("", "Cetificate", "Badge", "Plaque", "Cetificate")),
class = "data.frame",
row.names = c(NA, -5L))
If it is split into new columns, in my mind I am foreseeing multiple subsets, or loops and uniqued arrays to cycle though each row of data, prior to analysis. However, there may be better R packages to handle what I need to do.
I am not sure:
- What are the best practices here for structuring the df in R?
- What R packages are best suited to manipulating the data before running stats tools on them?
Any assitance or direction to detailed documentation would be appreciated.
CodePudding user response:
I think the best way to organize this data frame is in long format. The tidyverse functions pivot_longer
, mutate
and unnest
will help you here:
library(tidyverse)
df.new <- df.orig %>%
pivot_longer(Part:Achievements) %>%
mutate(value = strsplit(as.character(value), ",")) %>%
unnest(value) %>%
group_by(id, name) %>%
summarize(value = unique(value), .groups = "drop") %>%
as.data.frame()
This gives you
df.new
#> id name value
#> 1 1 Achievements Plaque
#> 2 2 Achievements Plaque
#> 3 2 Achievements Award
#> 4 2 Achievements Cetificate
#> 5 2 Part Gauge
#> 6 2 Part Bolt
#> 7 3 Achievements Pin
#> 8 3 Achievements Plaque
#> 9 3 Achievements Badge
#> 10 3 Part Cap
#> 11 3 Part AC
#> 12 3 Part Cam
#> 13 4 Achievements Cetificate
#> 14 4 Achievements Badge
#> 15 4 Achievements Plaque
#> 16 4 Part Piston
#> 17 4 Part Hose
#> 18 5 Achievements Badge
#> 19 5 Achievements Cetificate
#> 20 5 Part Fluid
#> 21 5 Part Fuse
#> 22 5 Part Bolt
and
table(df.new$name, df.new$value)
#> AC Award Badge Bolt Cam Cap Cetificate Fluid Fuse Gauge Hose Pin Piston Plaque
#> Achievements 0 1 3 0 0 0 3 0 0 0 0 1 0 4
#> Part 1 0 0 2 1 1 0 1 1 1 1 0 1 0
Created on 2022-10-03 with reprex v2.0.2
CodePudding user response:
Here is one option if you want it in wide format, where we turn each cell into a vector using str_split
. Then, we can use map
to keep only the unique values for each cell. Then, we can use unnest_wider
to put it into the wide format.
library(tidyverse)
df.orig %>%
mutate(across(c(Part, Achievements), ~ map(str_split(.x, ","), ~ unique(.x)))) %>%
unnest_wider(c(Part, Achievements), names_sep = "_")
Output
id Part_1 Part_2 Part_3 Achievements_1 Achievements_2 Achievements_3
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 NA NA Plaque NA NA
2 2 Gauge Bolt NA Plaque Award Cetificate
3 3 Cap AC Cam Pin Plaque Badge
4 4 Piston Hose NA Cetificate Badge Plaque
5 5 Fluid Fuse Bolt Badge Cetificate NA
CodePudding user response:
Here's a base R solution that converts it to wide format. you could then apply rowwise functions across groups different variables.
df.orig[df.orig==""]<-NA # NA instead of whitespace is needed here
new_df <- data.frame(id=df.orig$id) # new empty dataset
for(v in c("Part","Achievements")){
#separate commas
com_spl <- strsplit(as.character(df.orig[,v]), ",") #split strings
spl_df <- do.call(rbind.data.frame, com_spl) #create df.orig of split strings list
names(spl_df) <- paste0(v,1:ncol(spl_df)) #rename variables
#remove duplicates
dupes <- t(apply(spl_df,1,duplicated)) # boolean dataframe of duplicates
dup_idx <- which(dupes==T,arr.ind=T) #location of duplicates
spl_df[dup_idx] <- NA #remove duplicates
#combine
new_df <- cbind(new_df,spl_df) #add new vars to df.orig
}
#then e.g. unlist(new_df[2,grep("Achievements",names(new_df))])