Deleting rows with conditions from entire dataframe-CodePudding

I have a data.frame with mix numeric scale values and other continuous values. However, the missing data represented by the following values -1 and 8. this is an example:

df = data.frame(Name = c('George','Andrea', 'Micheal','Maggie','Ravi','Xien','Jalpa'), 
                 Grade_score=c(4,6,2,9,5,7,8),
                 Mathematics1_score=c(45,78,44,89,66,49,72),
                 Science_score=c(-1,52,45,88,-1,90,47),
                 Science_scale=c(-1,5,5,8,3,0,7))

I want to delete any row that has any of the missing data from the entire data.frame. I can do it column by column but I have more than 25 columns.

df2<-subset(df, df$x1 !="-1")

CodePudding user response：

A one liner using base R:

df[apply(df[,-1],1, \(x) !any(x %in% c(-1,8))),]

Or, you can convert to NA and use na.omit()

library(dplyr)  
df %>% 
  mutate(across(-Name, ~if_else(.x %in% c(-1,8),NA_real_,.x))) %>% 
  na.omit()

Output:

      Name Grade_score Mathematics1_score Science_score Science_scale
    <char>       <num>              <num>         <num>         <num>
1:  Andrea           6                 78            52             5
2: Micheal           2                 44            45             5
3:    Xien           7                 49            90             0

CodePudding user response：

library(tidyverse)

df %>%  
  filter(if_all(everything(), ~ !(.x %in% c(-1, 8))))

# A tibble: 3 × 5
  Name    Grade_score Mathematics1_score Science_score Science_scale
  <chr>         <dbl>              <dbl>         <dbl>         <dbl>
1 Andrea            6                 78            52             5
2 Micheal           2                 44            45             5
3 Xien              7                 49            90             0

CodePudding user response：

base R

df[rowSums(sapply(df, `%in%`, c(-1, 8))) < 1,]
#      Name Grade_score Mathematics1_score Science_score Science_scale
# 2  Andrea           6                 78            52             5
# 3 Micheal           2                 44            45             5
# 6    Xien           7                 49            90             0

dplyr

(While TomHoel's answer shows this as well, edited after I started drafting this, I'll keep it due to my discussion of multiple-standards in the note.)

library(dplyr)
df %>%
  filter(if_all(everything(), ~ !. %in% c(-1, 8)))
#      Name Grade_score Mathematics1_score Science_score Science_scale
# 1  Andrea           6                 78            52             5
# 2 Micheal           2                 44            45             5
# 3    Xien           7                 49            90             0

Note: if the Grade_score column should not be filtered out for 8 values (since that seems to be on the same scale as the others), then we can split the logic:

df %>%
  filter(
    if_all(-Grade_score, ~ !. %in% c(-1, 8)),
    Grade_score != -1
  )
#      Name Grade_score Mathematics1_score Science_score Science_scale
# 1  Andrea           6                 78            52             5
# 2 Micheal           2                 44            45             5
# 3    Xien           7                 49            90             0
# 4   Jalpa           8                 72            47             7

If you have multiple such columns, then we can do

df %>%
  filter(
    if_all(-c(Grade_score, Science_scale), ~ !. %in% c(-1, 8)), 
    if_all(c(Grade_score, Science_scale), ~ . != -1)
  )
#      Name Grade_score Mathematics1_score Science_score Science_scale
# 1  Andrea           6                 78            52             5
# 2 Micheal           2                 44            45             5
# 3  Maggie           9                 89            88             8
# 4    Xien           7                 49            90             0
# 5   Jalpa           8                 72            47             7

and the base R variant, keeping the use of `%in%` for convenience only.

df[rowSums(cbind(
  sapply(df[,-c(1:2, 5)], `%in%`, c(-1, 8)),
  sapply(df[,c(1:2, 5)], `%in%`, -1))) < 1,]
#      Name Grade_score Mathematics1_score Science_score Science_scale
# 2  Andrea           6                 78            52             5
# 3 Micheal           2                 44            45             5
# 4  Maggie           9                 89            88             8
# 6    Xien           7                 49            90             0
# 7   Jalpa           8                 72            47             7

CodePudding user response：

Here is another option for base R. We can replace the values you want with NA for the full dataset and then just omit all the rows with NA.

na.omit(replace(df, df ==-1 | df == 8, NA))
#>      Name Grade_score Mathematics1_score Science_score Science_scale
#> 2  Andrea           6                 78            52             5
#> 3 Micheal           2                 44            45             5
#> 6    Xien           7                 49            90             0

CodePudding user response：

When using apply beware of conversions to character in frames with character and/or numeric, since it may introduce leading/trailing spaces (apply expects and converts to matrices which can only hold one data class/type).

Since

apply(df, 2, function(x) grep(8, x, value = T))
$Name
character(0)

$Grade_score
[1] "8"

$Mathematics1_score
[1] "78" "89"

$Science_score
[1] "88"

$Science_scale
[1] " 8"               # <-------- " 8" will make direct comparisons FALSE

either use trimws

df[!apply(df, 1, function(x) any(trimws(x) %in% c(-1, 8))), ]
     Name Grade_score Mathematics1_score Science_score Science_scale
2  Andrea           6                 78            52             5
3 Micheal           2                 44            45             5
6    Xien           7                 49            90             0

or use, as pointed out by the other answers, apply without character columns (may be difficult to find/see/know in some cases!) sapply/lapply or libraries like dplyr