counting NA from R Dataframe in a for loop-CodePudding

If I have a timeseries dataframe in r from 2011 to 2018. How can I do a for loop where I count the number of NA per year separately and if that specific year has more than x % I drop that year or do something.

please refer to the image to see how my Dataframe looks like.

https://i.stack.imgur.com/2fwDk.png

years_values <- 2011:2020

years = pretty(years_values,n=10)
count = 0
for (y in years){
   for (j in df$Flow == y) {
     if (is.na(df$Flow[j]){
   count = count 1
     }
   }
   if (count) > 1{
     bfi = BFI(df$Flow == y)}
   else {bfi = NA}

  }

I am trying to use this code to loop for each year and then count the NA. and if the NA is greater than 1% I want to no compute for BFI and if it is less the compute for the BFI. I do have the BFI function working well. The problem I have is to formulate this loop.

CodePudding user response：

Since you have not included any reproducible data, let us take a simple example that captures the essence of your own data. We have a column called Year and one called Flow that contains some missing values:

df <- data.frame(Year = rep(2011:2013, each = 4),
                 Flow = c(1, 2, NA, NA, 5, 6, NA, 8, 9, 10, 11, 12))

df
#>    Year Flow
#> 1  2011    1
#> 2  2011    2
#> 3  2011   NA
#> 4  2011   NA
#> 5  2012    5
#> 6  2012    6
#> 7  2012   NA
#> 8  2012    8
#> 9  2013    9
#> 10 2013   10
#> 11 2013   11
#> 12 2013   12

Now suppose we want to count the number of missing values in each year. We can use table and is.na, like this:

tab <- table(df$Year, is.na(df$Flow))

tab
#>       
#>        FALSE TRUE
#>   2011     2    2
#>   2012     3    1
#>   2013     4    0

We can see that these are the absolute counts of missing values, but we can convert this into proportions by dividing the second column by the row sums of this table:

props <- tab[,2] / rowSums(tab)

props
#> 2011 2012 2013 
#> 0.50 0.25 0.00

Now, suppose we want to find and remove the years where more than 33% of cases are missing. We can just filter the values of props that are greater than 0.33 and get the associated year (or years):

years_to_drop <- names(props)[props > 0.33]

years_to_drop
#> [1] "2011"

Now we can use this to remove the years with more than 33% missing values from our original data frame by doing:

df[!df$Year %in% years_to_drop,]
#>    Year Flow
#> 5  2012    5
#> 6  2012    6
#> 7  2012   NA
#> 8  2012    8
#> 9  2013    9
#> 10 2013   10
#> 11 2013   11
#> 12 2013   12

^{Created on 2022-11-14 with reprex v2.0.2}

CodePudding user response：

As Allan Cameron suggests, there's no need to use a loop, and R is usually more efficient working vectorially anyway.

I would suggest a solution based on ave (using the synthetic data from the previous answer)

df$NA_fraction <- ave(df$Flow, df$Year, FUN = \(values) mean(is.na(values)))

df
   Year Flow NA_fraction
1  2011    1        0.50
2  2011    2        0.50
3  2011   NA        0.50
4  2011   NA        0.50
5  2012    5        0.25
6  2012    6        0.25
7  2012   NA        0.25
8  2012    8        0.25
9  2013    9        0.00
10 2013   10        0.00
11 2013   11        0.00
12 2013   12        0.00

You can then pick whatever threshold and filter by it

> df[df$NA_fraction < 0.3,]
   Year Flow NA_fraction
5  2012    5        0.25
6  2012    6        0.25
7  2012   NA        0.25
8  2012    8        0.25
9  2013    9        0.00
10 2013   10        0.00
11 2013   11        0.00
12 2013   12        0.00