Home > Software engineering >  Calculate the NaN average number between numeric values in a dataframe column in r
Calculate the NaN average number between numeric values in a dataframe column in r

Time:10-06

This is how my data looks like:

 dput(head(dataf,40))
structure(list(Time = c(20000224, 20000225, 20000226, 20000227, 
20000228, 20000229, 20000301, 20000302, 20000303, 20000304, 20000305, 
20000306, 20000307, 20000308, 20000309, 20000310, 20000311, 20000312, 
20000313, 20000314, 20000315, 20000316, 20000317, 20000318, 20000319, 
20000320, 20000321, 20000322, 20000323, 20000324, 20000325, 20000326, 
20000327, 20000328, 20000329, 20000330, 20000331, 20000401, 20000402, 
20000403), NDVI = c(NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, 
NaN, NaN, NaN, NaN, 0.562256741, 0.560792828, 0.559272321, 0.554727235, 
NaN, NaN, NaN, NaN, NaN, NaN, NaN, 0.506596306, 0.504768168, 
0.536223787, 0.5476394, 0.520311992, 0.52862363, NaN, NaN, NaN, 
NaN, NaN, NaN, NaN, NaN, NaN, NaN)), row.names = c(NA, -40L), class = c("tbl_df", 
"tbl", "data.frame"))

I want to know the mean value of NaN between two numerical values, in a column (NDVI) of a data frame.

See the example bellow:

Date NDVI
20000224 NaN
20000225 NaN
20000226 NaN
20000227 NaN
20000228 NaN
20000229 NaN
20000301 NaN
20000302 NaN
20000303 NaN
20000304 NaN
20000305 NaN
20000306 NaN
20000307 NaN
20000308 0.562256741
20000309 0.560792828
20000310 0.559272321
20000311 0.554727235
20000312 NaN
20000313 NaN
20000314 NaN
20000315 NaN
20000316 NaN
20000317 NaN
20000318 NaN
20000319 0.506596306
20000320 0.504768168
20000321 0.536223787
20000322 0.5476394
20000323 0.520311992
20000323 0.52862363

The mean value would be:

including the initial NaN's: 10 NaN

not including the 1st NaN's: 7 NaN

Note, I've put this data frame for purpose because it starts with NaN. The solution I pretend is either include the NaN at the beginning, so it will be accounted for the mean of NaN between values and one other solution that only starts to count the NaN after the first numeric value before the NaN.

Any help will be much appreciated.

CodePudding user response:

library(dplyr)

dataf %>% 
  group_by(cumsum(!is.na(NDVI))) %>% 
  mutate(cnt = cumsum(is.na(NDVI))) %>% 
  filter(cnt != 0) %>% 
    summarise(consecutive_NaNs = max(cnt)) %>% 
    select(consecutive_NaNs)

Returns:

  consecutive_NaNs
             <int>
1               13
2                7
3               10

If you only want the mean:

dataf %>% 
group_by(cumsum(!is.na(NDVI))) %>% 
mutate(cnt = cumsum(is.na(NDVI))) %>% 
filter(cnt != 0) %>% 
  summarise(consecutive_NaNs = max(cnt)) %>% 
  select(consecutive_NaNs) %>% 
    summarise(mean_run_length = mean(consecutive_NaNs))

Returns:

   mean_run_length
            <dbl>
1              10

P.S.

If you use the solution that gives all run lengths you can decide if you want to include the initial run or not....

  • Related