R: Calculate percentage of missing Values (NA) per day for a Column in a data frame using panel data-CodePudding

I have a huge panel data set with daily data. I would like to remove all days for which I have missing data of more than 25% of the observations in the column "Size".

I created the following data to show how my real data looks like:

structure(list(Product = c("A", "A", "A", "A", "A", "A", "A", 
"A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "C", 
"C", "C", "C", "C", "C", "C", "C"), Date = c("01.09.2018", "02.09.2018", 
"03.09.2018", "04.09.2018", "05.09.2018", "11.11.2020", "12.11.2020", 
"13.11.2020", "14.11.2020", "01.09.2018", "02.09.2018", "03.09.2018", 
"04.09.2018", "05.09.2018", "11.11.2020", "12.11.2020", "13.11.2020", 
"14.11.2020", "01.09.2018", "02.09.2018", "03.09.2018", "04.09.2018", 
"05.09.2018", "11.11.2020", "12.11.2020", "13.11.2020", "14.11.2020"
), Size = c(10L, 9L, NA, 3L, 4L, 5L, 3L, NA, 6L, 7L, 4L, NA, 
4L, 6L, 6L, 4L, 6L, 7L, 3L, 4L, NA, 2L, 4L, NA, 7L, 7L, 5L)), class = "data.frame", row.names = c(NA, 
-27L))

I already tried the following but I got stock on how to continue with the code:

Data %>% summarize(group_by(Date), NoData=(is.na(Size))

Then I got the error that I cannot use group_by for an object of the class "Date". Further, I don't know how I can automatically remove the days where I have more than 25% of missing values in the column "Size".

Could anyone help me here with the code that works for my problem?

I appreciate your help.

CodePudding user response：

If you summarize(), you lose lots of information on the individual days. Furthermore, use group_by() before further dplyr verbs. You can calculate the percentage of NA by dividing the sum of NA by the sum of days. as_tibble() is only used to better show the number of rows, it would work without it too. I added a column "CountDate" so that you know how many times the same day appears in your dataframe.

Data %>% as_tibble() %>%  
  group_by(Date) %>% 
  mutate(CountDate = n(), PercNA = sum(is.na(Size))/n()*100)

# A tibble: 27 x 5
# Groups:   Date [9]
   Product Date        Size CountDate PercNA
   <chr>   <chr>      <int>     <int>  <dbl>
 1 A       01.09.2018    10         3    0  
 2 A       02.09.2018     9         3    0  
 3 A       03.09.2018    NA         3  100  
 4 A       04.09.2018     3         3    0  
 5 A       05.09.2018     4         3    0  
 6 A       11.11.2020     5         3   33.3
 7 A       12.11.2020     3         3    0  
 8 A       13.11.2020    NA         3   33.3
 9 A       14.11.2020     6         3    0  
10 B       01.09.2018     7         3    0  
# ... with 17 more rows

To remove the dates having >25% NA, just filter():

Data %>% as_tibble() %>%  
  group_by(Date) %>% 
  mutate(CountDate = n(), PercNA = sum(is.na(Size))/n()*100) %>%
  filter(PercNA <25) %>% 
  ungroup()

# A tibble: 18 x 5
   Product Date        Size CountDate PercNA
   <chr>   <chr>      <int>     <int>  <dbl>
 1 A       01.09.2018    10         3      0
 2 A       02.09.2018     9         3      0
 3 A       04.09.2018     3         3      0
 4 A       05.09.2018     4         3      0
 5 A       12.11.2020     3         3      0
 6 A       14.11.2020     6         3      0
 7 B       01.09.2018     7         3      0
 8 B       02.09.2018     4         3      0
 9 B       04.09.2018     4         3      0
10 B       05.09.2018     6         3      0
11 B       12.11.2020     4         3      0
12 B       14.11.2020     7         3      0
13 C       01.09.2018     3         3      0
14 C       02.09.2018     4         3      0
15 C       04.09.2018     2         3      0
16 C       05.09.2018     4         3      0
17 C       12.11.2020     7         3      0
18 C       14.11.2020     5         3      0

CodePudding user response：

You need to divide is.na(Size) on the number of observations in order to get the percentage. You can get the number of observations each day with dplyr::n().

The filtering can also be done in one single operation, without summarize() or mutate(), like this:

library(dplyr)

Data %>% 
  group_by(Date) %>% 
  filter(is.na(Size)/n() <= 0.25)
#> # A tibble: 22 × 3
#> # Groups:   Date [8]
#>    Product Date        Size
#>    <chr>   <chr>      <int>
#>  1 A       01.09.2018    10
#>  2 A       02.09.2018     9
#>  3 A       04.09.2018     3
#>  4 A       05.09.2018     4
#>  5 A       11.11.2020     5
#>  6 A       12.11.2020     3
#>  7 A       14.11.2020     6
#>  8 B       01.09.2018     7
#>  9 B       02.09.2018     4
#> 10 B       04.09.2018     4
#> # … with 12 more rows

^{Created on 2022-04-15 by the reprex package (v2.0.1)}