Use of pivot_wider() to match dataframes information-CodePudding

I am trying to implement this procedure Create a data frame with values column names matching another data frame's name but for a slightly different type of dataset. I have three dataframes that are a subset of ~100 dataframes:

jack<-data.frame(12,33,4)
colnames(jack)[1]<-"sex"
colnames(jack)[2]<-"age"
colnames(jack)[3]<-"mass"

ethan<-data.frame(0,4,54)
colnames(ethan)[1]<-"sex"
colnames(ethan)[2]<-"age"
colnames(ethan)[3]<-"mass"

ellie<-data.frame(5,0,88)
colnames(ellie)[1]<-"sex"
colnames(ellie)[2]<-"age"
colnames(ellie)[3]<-"mass"

Those data come originally in a .csv and I read them in R with lapply. For this example's sake, I stored them in a list:

mylist<-list(jack,ethan,ellie)
names(mylist)<-c("jack","ethan","ellie")

And a list of names (note that not all names have a corresponding dataframe)

mydf<-as.data.frame(c("martha","jack","luis","ethan","ellie"))
colnames(mydf)[1]<-"name"

Following the above mentioned procedure, I first create an empty dataframe with all the information:

mydf_1 <- mydf %>% 
  mutate(sex = 0,
         age = 0,
         mass = 0) %>% 
  select(sex,age,mass,name)

Then I create, for each dataframe, a column with its name

mylist <- Map(cbind, mylist, name = names(mylist))

Now I cannot understand what to put in pivot_wider() to have the values from sex, age, and mass matching the corresponding column and a 0 if there is no match (or if the value is 0). This is what I tried so far:

mydf_final <- mylist %>% 
  bind_rows() %>% 
  bind_rows(mydf_1) %>% 
  pivot_wider(names_from = c("sex","age","mass"), values_from = c(sex,age,mass), values_fill = 0) %>% 
  select(-`0`)

The desired output is a single dataframe.

Unluckily I am not very familiar with the tidyverse... I am using R 4.0.3

CodePudding user response：

Here is an alternative approach, if the output is like demonstrated by jrcalabrese:

library(dplyr)
bind_rows(list(jack=jack, ethan=ethan, ellie=ellie), .id = 'name') %>% 
  full_join(mydf, by="name") %>% 
  replace(is.na(.), 0)

    name sex age mass
1   jack  12  33    4
2  ethan   0   4   54
3  ellie   5   0   88
4 martha   0   0    0
5   luis   0   0    0

CodePudding user response：

If I understand correctly, it sounds like you want to merge everyone's (i.e. Jack's, Ethan's) dataframes together into one dataframe, but use zeroes to indicate if someone's (e.g., Martha's) data is missing. Rather than putting everything in a list, we can bind the three dataframes directly, merge with the five names, and replace any NA with zeroes.

EDIT: I've edited the below output so the code is less dependent on individual dataframes and tries to use the list more.

library(tidyverse)

# load in a list of .csv files
jack<-data.frame(12,33,4)
colnames(jack)[1]<-"sex"
colnames(jack)[2]<-"age"
colnames(jack)[3]<-"mass"
ethan<-data.frame(0,4,54)
colnames(ethan)[1]<-"sex"
colnames(ethan)[2]<-"age"
colnames(ethan)[3]<-"mass"
ellie<-data.frame(5,0,88)
colnames(ellie)[1]<-"sex"
colnames(ellie)[2]<-"age"
colnames(ellie)[3]<-"mass"
mylist<-list(jack,ethan,ellie)
names(mylist)<-c("jack","ethan","ellie")
rm(jack, ethan, ellie)

# load in a reference list that contains empty dataframes
mydf<-as.data.frame(c("martha","jack","luis","ethan","ellie"))
colnames(mydf)[1]<-"name"

# give everyone in `mylist` a variable that indicates its name
goodnames <- names(mylist)
mylist <- mapply(cbind, mylist, "name" = goodnames, SIMPLIFY = FALSE)
rm(goodnames)

# bind all the good dfs in mylist
final <- mylist %>%
  bind_rows() %>%
  merge(., mydf, by = "name", all.y = TRUE) %>%
  replace(is.na(.), 0)

head(final)
#>     name sex age mass
#> 1  ellie   5   0   88
#> 2  ethan   0   4   54
#> 3   jack  12  33    4
#> 4   luis   0   0    0
#> 5 martha   0   0    0

CodePudding user response：

This can be done in base R - loop over the intersecting names from the 'mylist' and the names column and then do the assign

mydf[c("sex", "age", "mass")] <- 0
for(nm in intersect(mydf$name, names(mylist))) 
   mydf[mydf$name == nm, c("sex", "age", "mass")] <- mylist[[nm]]

-output

> mydf
    name sex age mass
1 martha   0   0    0
2   jack  12  33    4
3   luis   0   0    0
4  ethan   0   4   54
5  ellie   5   0   88