I used the list to create 4 datasets. Now I want to list all potential ID variables in each dataset. My criteria are: 1)if this variable has over 80% unique observations; 2) If this variable does not have missing value over 30%.
To get those statistic variables, I first use skimr function in R to get a tibble containing all information, then I used filter to sift out the variables I am looking for based on the two criteria aforementioned. Here is my code:
dfa<- dflist[[1]]%>%
mutate_if(is.numeric,as.character)%>%
skim()%>%
as_tibble()%>%
filter(character.n_unique >=nrow(dflist[[1]])*0.01)%>%
filter(n_missing<=nrow(dflist[[1]])*0.30)
This code works fine and returns the expected variables for dataset 1. However, I have 4 different size datasets, so I am considering to integrate it into a loop code. Here is my try:
First, I create a dfid list to contain the new results since I do not want the dflist is modified. Then I changed 1 in previous code in dflist[[1]] to "i". But this code does not work, the R warns that "Error in filter(., dflist[[i]][, character.n_unique] >= nrow(dflist[[1]]) * :
Caused by error in [.data.frame
:
! undefined columns selected".
Here is my code:
dfid<-list()
for (i in 1:4){
dfid[[i]]<-dflist[[i]]%>%
mutate_if(is.numeric,as.character)%>%
skim()%>%
as_tibble()%>%
filter(dflist[[i]][,character.n_unique] >=nrow(dflist[[i]])*0.01)%>%
filter(dflist[[i]][,n_missing]<=nrow(dflist[[i]])*0.30)
}
So my questions are:
- How to fix this error to make the goal possible?
- Once the dfid[[i]] has desired variables from 4 different datasets, what code I should add in to loop to combine them (4 lists) together and distinct the variable name, finally get the vector of variable names from this combined list or dataset?
Thanks a lot for your help in advance~~!
CodePudding user response:
The columns should be quoted if we are using [
unless it is an object. It may be easier to loop with map/lapply
library(purrr)
library(dplyr)
dfid <- map(dflist, ~ .x %>%
mutate(across(where(is.numeric), as.character))%>%
skim()%>%
as_tibble()%>%
filter(character.n_unique >= n()*0.01)%>%
filter(n_missing <= n()*0.30))
We don't need the [
when we use the chain
dfid <- vector('list', length(dflist))
for (i in seq_along(dflist)){
tmp <- dflist[[i]]
dfid[[i]] <- tmp %>%
mutate_if(is.numeric,as.character)%>%
skim()%>%
as_tibble()%>%
filter(character.n_unique >=n()*0.01)%>%
filter(n_missing <=n()*0.30)
}