I'm not able to get an even amount of data to run the cor.test-CodePudding

I'm supposed to run the cor.test on the relationship between dep_delay and distance with the result showing and the initial coding I used to remove the outliers.

t = -14.451, df = 326677, cor =  -0.02527463

Here's what I've done and the errors:

delay_thresh = quantile(flights$dep_delay, p=c(0.003, 0.997), na.rm=T)
dist_thresh = quantile(flights$distance,p=c(0.003, 0.997), na.rm=T)
                  
Q1a <- which(flights$dep_delay>delay_thresh | flights$dep_delay<delay_thresh)
Q1b <- which(flights$distance>dist_thresh | flights$distance<dist_thresh)
Q1b <- na.omit(Q1b) 
Q1a <- na.omit(Q1a)

Here's what I've tried:

cor.test(flights$dep_delay ~ flights$distance)
Error in cor.test.formula(flights$dep_delay ~ flights$distance) : 
  'formula' missing or invalid

cor.test(formula = dep_delay~distance, flights)
Error in cor.test.default(formula = dep_delay ~ distance, flights) : 
  'x' must be a numeric vector

cor.test(delay_thresh, dist_thresh)
Error in cor.test.default(delay_thresh, dist_thresh) : 
  not enough finite observations

At one point I tried using indices and got this:

>       indices = union(
        which(flights$dep_delay>delay_thresh[1] & flights$dep_delay<delay_thresh[2]),
        which(flights$distance>dist_thresh[1] & flights$distance<dist_thresh[2]))
> Q2 <- cor.test(flights$dep_delay[indices], flights$distance[indices])
> Q2

Pearson's product-moment correlation

data:  flights$dep_delay[indices] and flights$distance[indices]
t = -13.897, df = 328459, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02765802 -0.02082235
sample estimates:
        cor 
-0.02424047

So it gave results, but on the original dataset and not the desired outcome. So then I tried:

Q2 <- cor.test(delay_thresh, dist_thresh, method = 'pearson') 
Error in cor.test.default(delay_thresh, dist_thresh, method = "pearson") : 
  not enough finite observations

So then I went back to the top and did away with the indices:

Q2a <- cor.test(Q1a, Q1b, method="pearson")
Error in cor.test.default(Q1a, Q1b, method = "pearson") : 
  not enough finite observations

Q2a <- cor.test(Q1a, Q1b)
Error in cor.test.default(Q1a, Q1b) : 
  'x' and 'y' must have the same length

Q2a <- cor.test((Q1a ~ Q1b), drop.unused.levels = TRUE)
Error in cor.test.formula((Q1a ~ Q1b), drop.unused.levels = TRUE) : 
  'formula' missing or invalid

Q2 <- cor.test(delay_thresh, dist_thresh, use="pairwise.complete") 
Error in cor.test.default(delay_thresh, dist_thresh, use = "pairwise.complete") : 
  not enough finite observations

Any help is greatly appreciated. Like Is aid, the indices work, but not with the desired results, so I'm pretty sure there's something simple I'm overlooking, but I've been researching this for a couple of days now, and still can't pinpoint it. It won't allow me to upload the dataset because it isn't local, but it is nycflights13 found here https://nycflights13.tidyverse.org/

CodePudding user response：

If I understand right, you're just trying to run cor.test on the nycflights13 dataset without the outliers you've specified. I think the biggest thing is that you need to exclude the outliers in both variables together as pairs (dep_delay and distance) instead of excluding them in each variable alone:

nycflights13_DT <- as.data.table(flights)
nycflights13_clean <- nycflights13_DT[nycflights13_DT$dep_delay > delay_thresh[[1]] & 
                      nycflights13_DT$dep_delay < delay_thresh[[2]] & 
                      nycflights13_DT$distance>dist_thresh[[1]] & 
                      nycflights13_DT$distance < dist_thresh[[2]]]

Then they'll have the same length and you can run the cor.test without errors:

cor.test(nycflights13_clean$dep_delay, nycflights13_clean$distance)

Pearson's product-moment correlation

data:  nycflights13_clean$dep_delay and nycflights13_clean$distance
t = -13.647, df = 316421, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02773612 -0.02077163
sample estimates:
        cor 
-0.02425417