Home > Software design >  Checking if columns in dataframe are "paired"
Checking if columns in dataframe are "paired"

Time:11-20

I have a very long data frame (~10,000 rows), in which two of the columns look something like this.

  A     B
  1   5.5
  1   5.5
  2   201
  9    18
  9    18
  2   201
  9    18
...   ...

Just scrubbing through the data it seems that the two columns are "paired" together, but is there any way of explicitly checking this?

CodePudding user response:

If you run this you will see how many unique values of B there are for each value of A

tapply(dat$B, dat$A, function(x) length(unique(x)))

So if the max of this vector is 1 then there are no values of A that have more than one corresponding value of B.

CodePudding user response:

You want to know if value x in column A always means value y in column B? Let's group by A and count the distinct values in B:

df <- data.frame(
  A = c(1, 1, 2, 9, 9, 2, 9),
  B = c(5.5, 5.5, 201, 18, 18, 201, 18)
)

df %>%
  group_by(A) %>%
  distinct(B) %>%
  summarize(n_unique = n())

# A tibble: 3 x 2
      A n_unique
  <dbl>    <int>
1     1        1
2     2        1
3     9        1

If we now alter the df to the case that this is not true:

df <- data.frame(
  A = c(1, 1, 2, 9, 9, 2, 9),
  B = c(5.5, 5.4, 201, 18, 18, 201, 18)
)

df %>%
  group_by(A) %>%
  distinct(B) %>%
  summarize(n_unique = n())

# A tibble: 3 x 2
      A n_unique
  <dbl>    <int>
1     1        2
2     2        1
3     9        1

Observe the increased count for group 1. As you have more than 10000 rows, what remains is to see whether or not there is at least one instance that has n_unique > 1, for instance by filter(n_unique > 1)

  • Related