Home > Net >  Using loop to repeat the same function for different datasets
Using loop to repeat the same function for different datasets

Time:11-30

I used the list to create 4 datasets. Now I want to list all potential ID variables in each dataset. My criteria are: 1)if this variable has over 80% unique observations; 2) If this variable does not have missing value over 30%.

To get those statistic variables, I first use skimr function in R to get a tibble containing all information, then I used filter to sift out the variables I am looking for based on the two criteria aforementioned. Here is my code:

 dfa<- dflist[[1]]%>%
      mutate_if(is.numeric,as.character)%>%
      skim()%>%
      as_tibble()%>%
      filter(character.n_unique >=nrow(dflist[[1]])*0.01)%>%
      filter(n_missing<=nrow(dflist[[1]])*0.30)

This code works fine and returns the expected variables for dataset 1. However, I have 4 different size datasets, so I am considering to integrate it into a loop code. Here is my try: First, I create a dfid list to contain the new results since I do not want the dflist is modified. Then I changed 1 in previous code in dflist[[1]] to "i". But this code does not work, the R warns that "Error in filter(., dflist[[i]][, character.n_unique] >= nrow(dflist[[1]]) * : Caused by error in [.data.frame: ! undefined columns selected".

Here is my code:

dfid<-list()
for (i in 1:4){
    dfid[[i]]<-dflist[[i]]%>%
            mutate_if(is.numeric,as.character)%>%
            skim()%>%
            as_tibble()%>%
            filter(dflist[[i]][,character.n_unique] >=nrow(dflist[[i]])*0.01)%>%
            filter(dflist[[i]][,n_missing]<=nrow(dflist[[i]])*0.30)
}

So my questions are:

  1. How to fix this error to make the goal possible?
  2. Once the dfid[[i]] has desired variables from 4 different datasets, what code I should add in to loop to combine them (4 lists) together and distinct the variable name, finally get the vector of variable names from this combined list or dataset?

Thanks a lot for your help in advance~~!

CodePudding user response:

The columns should be quoted if we are using [ unless it is an object. It may be easier to loop with map/lapply

library(purrr)
library(dplyr)
dfid <- map(dflist, ~ .x %>% 
      mutate(across(where(is.numeric), as.character))%>%
      skim()%>%
      as_tibble()%>%
      filter(character.n_unique >= n()*0.01)%>%
      filter(n_missing <= n()*0.30))

We don't need the [ when we use the chain

dfid <- vector('list', length(dflist))
for (i in seq_along(dflist)){
    tmp <- dflist[[i]]
      dfid[[i]] <-  tmp %>%
            mutate_if(is.numeric,as.character)%>%
            skim()%>%
            as_tibble()%>%
            filter(character.n_unique >=n()*0.01)%>%
            filter(n_missing <=n()*0.30)
}
  •  Tags:  
  • r
  • Related