I have a huge panel data set with daily data. I would like to remove all days for which I have missing data of more than 25% of the observations in the column "Size".
I created the following data to show how my real data looks like:
structure(list(Product = c("A", "A", "A", "A", "A", "A", "A",
"A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "C",
"C", "C", "C", "C", "C", "C", "C"), Date = c("01.09.2018", "02.09.2018",
"03.09.2018", "04.09.2018", "05.09.2018", "11.11.2020", "12.11.2020",
"13.11.2020", "14.11.2020", "01.09.2018", "02.09.2018", "03.09.2018",
"04.09.2018", "05.09.2018", "11.11.2020", "12.11.2020", "13.11.2020",
"14.11.2020", "01.09.2018", "02.09.2018", "03.09.2018", "04.09.2018",
"05.09.2018", "11.11.2020", "12.11.2020", "13.11.2020", "14.11.2020"
), Size = c(10L, 9L, NA, 3L, 4L, 5L, 3L, NA, 6L, 7L, 4L, NA,
4L, 6L, 6L, 4L, 6L, 7L, 3L, 4L, NA, 2L, 4L, NA, 7L, 7L, 5L)), class = "data.frame", row.names = c(NA,
-27L))
I already tried the following but I got stock on how to continue with the code:
Data %>% summarize(group_by(Date), NoData=(is.na(Size))
Then I got the error that I cannot use group_by for an object of the class "Date". Further, I don't know how I can automatically remove the days where I have more than 25% of missing values in the column "Size".
Could anyone help me here with the code that works for my problem?
I appreciate your help.
CodePudding user response:
If you summarize()
, you lose lots of information on the individual days. Furthermore, use group_by()
before further dplyr verbs. You can calculate the percentage of NA by dividing the sum of NA by the sum of days. as_tibble()
is only used to better show the number of rows, it would work without it too. I added a column "CountDate" so that you know how many times the same day appears in your dataframe.
Data %>% as_tibble() %>%
group_by(Date) %>%
mutate(CountDate = n(), PercNA = sum(is.na(Size))/n()*100)
# A tibble: 27 x 5
# Groups: Date [9]
Product Date Size CountDate PercNA
<chr> <chr> <int> <int> <dbl>
1 A 01.09.2018 10 3 0
2 A 02.09.2018 9 3 0
3 A 03.09.2018 NA 3 100
4 A 04.09.2018 3 3 0
5 A 05.09.2018 4 3 0
6 A 11.11.2020 5 3 33.3
7 A 12.11.2020 3 3 0
8 A 13.11.2020 NA 3 33.3
9 A 14.11.2020 6 3 0
10 B 01.09.2018 7 3 0
# ... with 17 more rows
To remove the dates having >25% NA, just filter()
:
Data %>% as_tibble() %>%
group_by(Date) %>%
mutate(CountDate = n(), PercNA = sum(is.na(Size))/n()*100) %>%
filter(PercNA <25) %>%
ungroup()
# A tibble: 18 x 5
Product Date Size CountDate PercNA
<chr> <chr> <int> <int> <dbl>
1 A 01.09.2018 10 3 0
2 A 02.09.2018 9 3 0
3 A 04.09.2018 3 3 0
4 A 05.09.2018 4 3 0
5 A 12.11.2020 3 3 0
6 A 14.11.2020 6 3 0
7 B 01.09.2018 7 3 0
8 B 02.09.2018 4 3 0
9 B 04.09.2018 4 3 0
10 B 05.09.2018 6 3 0
11 B 12.11.2020 4 3 0
12 B 14.11.2020 7 3 0
13 C 01.09.2018 3 3 0
14 C 02.09.2018 4 3 0
15 C 04.09.2018 2 3 0
16 C 05.09.2018 4 3 0
17 C 12.11.2020 7 3 0
18 C 14.11.2020 5 3 0
CodePudding user response:
You need to divide is.na(Size)
on the number of observations in order to get the percentage. You can get the number of observations each day with dplyr::n()
.
The filtering can also be done in one single operation, without summarize()
or mutate()
, like this:
library(dplyr)
Data %>%
group_by(Date) %>%
filter(is.na(Size)/n() <= 0.25)
#> # A tibble: 22 × 3
#> # Groups: Date [8]
#> Product Date Size
#> <chr> <chr> <int>
#> 1 A 01.09.2018 10
#> 2 A 02.09.2018 9
#> 3 A 04.09.2018 3
#> 4 A 05.09.2018 4
#> 5 A 11.11.2020 5
#> 6 A 12.11.2020 3
#> 7 A 14.11.2020 6
#> 8 B 01.09.2018 7
#> 9 B 02.09.2018 4
#> 10 B 04.09.2018 4
#> # … with 12 more rows
Created on 2022-04-15 by the reprex package (v2.0.1)