How to "skip" rows if function conditions are not matched or the function does not work in-CodePudding

I m new to R. I lately often work with large time data sets where there are measurements every 10 minutes or hourly or so values over several years. The data often has errors but it is not a problem if it sometimes has holes in it. The problem is that R always reports an error if a certain property does not apply and then stops the code. I never know how to tell R that if something doesn't apply it should just override it or something. I know it's probably a stupid question but it's super tedious to go through so many datasets and have to take out all the mistakes. For example, in this case I want to analyse the (Pearson) correlation between air temperature and water temperature per day. On 01.01.2005 there is unfortunately only one measurement which stops everything because R cannot make a correlation between only 2 values (of course). How can I incorporate that something like this is skipped?

My dataset looks like this

Tag	Wassertemperatur	Lufttemperatur
2004-12-12	Value	Value
2004-12-12	Value	Value
2004-12-12	Value	Value
2004-12-12	Value	Value
2004-12-12	Value	Value
2004-12-12	Value	Value
2005-01-01	Value	Value
2005-01-02	Value	Value
2005-01-02	Value	Value
2005-01-02	Value	Value

To do the correlation per day I tried this:

t_h_korr <- t_h %>% 
  group_by(Tag) %>% 
  summarise(korr_test = list(cor.test(Wassertemperatur,Lufttemperatur))) %>% 
  mutate(rsquared = unlist(map(korr_test, "estimate")),
         pval = unlist(map(korr_test, "p.value")))

And I get error

Error in `summarise()`:
! Problem while computing `korr_test = list(cor.test(Wassertemperatur,
  Lufttemperatur))`.
i The error occurred in group 372: Tag = 2005-01-01.
Caused by error in `cor.test.default()`:
! not enough finite observations
Run `rlang::last_error()` to see where the error occurred.

I know maybe the correlation itself does not make any sense - I want to find that out - but I always have to stop because there is a mistake in the data set. I hope I have explained my problem well.

CodePudding user response：

You can create a custom function and return NAs when you only have one pair of observations for a given date.

Reproducible example:

set.seed(123)

t_h_korr <- data.frame(Tag = as.Date(c('2000-01-01',
                                       '2000-01-01',
                                       '2000-01-01',
                                       '2000-01-01',
                                       '2000-01-02',
                                       '2000-01-03',
                                       '2000-01-03',
                                       '2000-01-03',
                                       '2000-01-03')),
                       Wassertemperatur = rnorm(9, mean = 5,
                                                sd = 1),
                   Lufttemperatur = rnorm(9, mean = 10,
                                          sd = 1))

Custom function:

corr2 <- function(x, y){
  result <- list()
  if(length(x) > 1){
    correlation <- cor.test(x, y)
    result[[1]] <- correlation$estimate
    result[[2]] <- correlation$p.value
  }else{
    result[[1]] <- NA
    result[[2]] <- NA
  }
  return(result)
}

Usage example:

t_h_korr %>% group_by(Tag) %>%
  summarise(korr_test = corr2(Wassertemperatur, Lufttemperatur)) %>% 
  mutate(estimate = korr_test[[1]], pvalue = korr_test[[2]]) %>% 
  select(-korr_test) %>% 
  unique()

## # A tibble: 3 × 3
## # Groups:   Tag [3]
## Tag        estimate pvalue
## <date>        <dbl>  <dbl>
## 1 2000-01-01   0.123   0.877
## 2 2000-01-02  NA      NA    
## 3 2000-01-03   0.0960  0.904

CodePudding user response：

Welcome to Stack Overflow and R!

Here's my answer using dplyr package:

# Load packages
library(dplyr)

# Create sample dataframe
set.seed(123)
df <- data.frame(
  date = as.Date(c("2004/12/12", "2004/12/12", "2004/12/12", "2004/12/12",
                   "2004/12/12", "2004/12/12", "2005/01/01", "2005/01/02", 
                   "2005/01/02", "2005/01/02")),
  air_temp = sample(x = seq(from = 20, to = 30, by = 0.01), size = 10),
  water_temp = sample(x = seq(from = 15, to = 20, by = 0.01), size = 10))

# If there are only 2 elements per group, then compute the rsqared and pval. 
# If not, write NA. 
df %>% 
  group_by(date) %>% 
  summarise(rsquared = ifelse(n() > 2, cor.test(air_temp, water_temp)$estimate[[1]], NA),
            pval = ifelse(n() > 2, cor.test(air_temp, water_temp)$p.value, NA))

# Output
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
# A tibble: 3 × 3
# Groups:   date [3]
  date       rsquared   pval
  <date>        <dbl>  <dbl>
1 2004-12-12   -0.441  0.381
2 2005-01-01   NA     NA    
3 2005-01-02    0.976  0.140

I hope this helps!