Home > Enterprise >  R - Creating/subset dataframes within loop or apply family
R - Creating/subset dataframes within loop or apply family

Time:11-03

I would like to subset a R dataframe if multiple variables (ending in same suffix) do NOT contain NAs.

employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary1 <- c(NA, 20400, 26800)
salary2 <- c(29045, NA, 78765)
date1 <- as.Date(c(NA,'2008-3-25','2007-3-14'))
date2 <- as.Date(c('2010-11-1',NA,'2007-3-14'))

employ.data <- data.frame(employee, salary1, salary2, date1, date2)

for (i in 1:2) {
  assign("employ.data", i, subset(employ.data, !is.na(employ.data$date[i]) & !is.na(employ.data$salary[i])))
}

Final result would hopefully produce two separate dataframes looking something like:

employ.data1:
| employee   | salary1 | salary2 | date1      | date2      |
| Peter Gynn | 20400   | NA      | 2008-03-25 | NA         |
| Jolie Hope | 26800   | 78765   | 2007-03-14 | 2007-03-14 |

employ.data2:
| employee   | salary1 | salary2 | date1      | date2      |
| John Doe   | NA      | 29045   | NA         | 2010-11-01 |
| Jolie Hope | 26800   | 78765   | 2007-03-14 | 2007-03-14 |

Thanks in advance!

CodePudding user response:

(Up front: it's generally better to deal with a list of frames instead of using assign to dynamically create objects.)

numfields <- grep("[0-9] $", colnames(employ.data), value = TRUE)
split(numfields, gsub(".*?([0-9] )$", "\\1", numfields))
# $`1`
# [1] "salary1" "date1"  
# $`2`
# [1] "salary2" "date2"  
out <- lapply(split(numfields, gsub(".*?([0-9] )$", "\\1", numfields)),
              function(flds) employ.data[ complete.cases(subset(employ.data, select = flds)), ])
out
# $`1`
#     employee salary1 salary2      date1      date2
# 2 Peter Gynn   20400      NA 2008-03-25       <NA>
# 3 Jolie Hope   26800   78765 2007-03-14 2007-03-14
# $`2`
#     employee salary1 salary2      date1      date2
# 1   John Doe      NA   29045       <NA> 2010-11-01
# 3 Jolie Hope   26800   78765 2007-03-14 2007-03-14

This will dynamically find all numbered fields, though admittedly it does not enforce "pairs".

CodePudding user response:

The non_na function returns the rows for which there are no missing values in the columns whose names end in i. It first defines those column names as cn and then uses complete.cases to define a logical vector which indicates which rows have no NA's in those columns and then returns those rows from employ.data.

Assume that the only numbers in columns are at the end of names and that the unique such numbers define the possible suffixes. The stems are those unique names after removing the suffix.
We assume that column 1 is to be excluded from this calculation. We use gsub/unique to define suffixes and stems. Alternately hard code suffixes <- as.character(1:2) and stems <- c("salary", "date") .

Finally run non_na on each suffix creating a list whose i component contains a data frame with the rows corresponding to i. L[[i]] will be the i data frame.

non_na <- function(i) {
  cn <- paste0(stems, i)
  employ.data[complete.cases(employ.data[cn]), ]
}

nms <- names(employ.data)[-1]
suffixes <- unique(gsub("\\D", "", nms)) # c("1", "2")
stems <- unique(gsub("\\d", "", nms)) # c("salary", "date")
L <- Map(non_na, suffixes); L

giving:

$`1`
    employee salary1 salary2      date1      date2
2 Peter Gynn   20400      NA 2008-03-25       <NA>
3 Jolie Hope   26800   78765 2007-03-14 2007-03-14

$`2`
    employee salary1 salary2      date1      date2
1   John Doe      NA   29045       <NA> 2010-11-01
3 Jolie Hope   26800   78765 2007-03-14 2007-03-14
  • Related