How to select rows where values are present in only 2 columns from a data frame-CodePudding

I have a long data frame with a lot of NA's.

I would like to find the ratio of the value of Column A and Column D. How do I only select the rows where there's data present in both Column A and Column D?

As an example, in this image I want to only compare the values of rows 2,4,6 for Columns A and D.

I can use na.omit on each row, but I'm not sure how to ask R to find the rows where data is present in both columns.

CodePudding user response：

Like so:

library(tidyverse)

df <- data.frame(col1=c(1, NA, 7), col2=c(NA, 2, 5))

#   col1 col2
# 1    1   NA
# 2   NA    2
# 3    7    5



df %>% 
  filter(!is.na(col1) & !is.na(col2))
# 
#   col1 col2
# 1    7    5

CodePudding user response：

Here are some options using base R.

Using is.na():

df[!is.na(df$A) & !is.na(df$D), ]
#     A   B     C   D
# 1 432  15 54654 432
# 2 321  NA  4534 324
# 3 123 432  6543  75

Using complete.cases():

df[complete.cases(df$A, df$D), ]
#     A   B     C   D
# 1 432  15 54654 432
# 2 321  NA  4534 324
# 3 123 432  6543  75

Using subset():

subset(df, complete.cases(A, D))
#     A   B     C   D
# 1 432  15 54654 432
# 2 321  NA  4534 324
# 3 123 432  6543  75

Here are a couple of other options using the tidyverse.

Using drop_na() from tidyr:

library(dplyr)
library(tidyr)

df %>% 
  drop_na(A, D)
#     A   B     C   D
# 1 432  15 54654 432
# 2 321  NA  4534 324
# 3 123 432  6543  75

Using complete.cases() with filter() from dplyr (this one is actually my preference for readability):

library(dplyr)

df %>% 
  filter(complete.cases(A, D))
#     A   B     C   D
# 1 432  15 54654 432
# 2 321  NA  4534 324
# 3 123 432  6543  75

Data:

df <- data.frame(A = c(23, 432, NA, 321, NA, 123),
                 B = c(NA, 15, NA, NA, 32, 432),
                 C = c(NA, 54654, NA, 4534, NA, 6543),
                 D = c(NA, 432, 654, 324, 643, 75))

CodePudding user response：

To only get the values you can simply:

(df$A / df$D)[!is.na(df$A / df$D)]
[1] 1.0000000 0.9907407 1.6400000

This puts the values as an additional column:

cbind( df, df$A / df$D )
    A   B     C   D df$A/df$D
1  23  NA    NA  NA        NA
2 432  15 54654 432 1.0000000
3  NA  NA    NA 654        NA
4 321  NA  4534 324 0.9907407
5  NA  32    NA 643        NA
6 123 432  6543  75 1.6400000