Home > Mobile >  R: Fill the missing values if good inputs persist in data
R: Fill the missing values if good inputs persist in data

Time:01-22

Imagine a dataset:

df1 <- tibble::tribble(~Client, ~Client_ID, ~Year,  ~NPurchases,    ~Age,   ~Sex,   ~Hobby,
"Alex Smith",   88888,  2022,   NA, 30, "Male", "",
"Alex Smith",   88888,  2022,   45, 30, "Male", "Stamps",
"Alex Smith",   88888,  2021,   32, 29, "Male", "Stamps",
"Eugene Hugs",  788272, 2022,   2,  19, "Male", "Cats")

Or, the same dataset for visual representation:

enter image description here

The core columns are Client_ID and Year. For example, Alex Smith has two inputs for 2022. Could you, please, suggest nice methods to:

  1. Remove rows with more missing values
  2. Find duplicated rows with different missing data, look at them (to see if there is no error), and then remove the rows with fewer data.

Any nice suggestions will be much appreciated :)

For example, my instant idea to solve was:

unique(rbindlist(list(df1))[!is.na(NPurchases), .(Client, Client_ID, Year, Age, NPurchases, Sex, Hobby)])

Output: enter image description here

But I am sure there should be more universal and tidy methods :)

DESIRED OUTPUTS:

  1. For the first variant: enter image description here
  2. For the second variant: Similar solution that would provide the data on the rows it deleted.

There is a great solution by dplyr. Can you suggest other nice solutions?

CodePudding user response:

Please try filter_at function from dplyr

library(dplyr)

df2 <- df1 %>% filter_at(vars(everything()), ~!(.x %in% c(NA, ' ')))

Created on 2023-01-21 with reprex v2.0.2

# A tibble: 3 × 7
  Client      Client_ID  Year NPurchases   Age Sex   Hobby 
  <chr>           <dbl> <dbl>      <dbl> <dbl> <chr> <chr> 
1 Alex Smith      88888  2022         45    30 Male  Stamps
2 Alex Smith      88888  2021         32    29 Male  Stamps
3 Eugene Hugs    788272  2022          2    19 Male  Cats  

CodePudding user response:

Alternatively, you may try the below code which is without the filter_at

df2 <- df1 %>% mutate(sum=rowSums(across(everything(), ~(.x %in% c(NA, ' '))))) %>% 
filter(sum==0) %>% select(-sum)

Created on 2023-01-21 with reprex v2.0.2

# A tibble: 3 × 7
  Client      Client_ID  Year NPurchases   Age Sex   Hobby 
  <chr>           <dbl> <dbl>      <dbl> <dbl> <chr> <chr> 
1 Alex Smith      88888  2022         45    30 Male  Stamps
2 Alex Smith      88888  2021         32    29 Male  Stamps
3 Eugene Hugs    788272  2022          2    19 Male  Cats  

  • Related