Home > Software design >  how to find the columns with the most similar values
how to find the columns with the most similar values

Time:12-02

Just starting to use R and am feeling a bit confused. Suppose I have three columns

data = data.frame(id=c(101, 102, 103),column1=c(2, 4, 9), 
                  column2=c(3, 4, 2), column3=c(5, 15, 7))

How can I create a new column (e.g., colmean) that is the mean of the two columns closest in value? I thought about doing a bunch of ifelse statements, but that seemed unnecessarily messy.

In this case, for instance, colmean=c(2.5, 4, 8).

CodePudding user response:

Here is a version with a loop:

data = data.frame(id=c(101, 102, 103),column1=c(2, 4, 9), 
  column2=c(3, 4, 2), column3=c(5, 15, 7))


data$colmean <- NaN # set up empty column for results
for(i in seq(nrow(data))){
  data.i <- data[i,-1] # get ith row
  d <- as.matrix(dist(c(data.i))) # get distances between values
  diag(d) <- NaN # replace diagonal of distance matrix with NaN
  hit <- which.min(d) # identify value of lowest distance
  pos <- c(row(d)[hit], col(d)[hit]) # get the position (i.e. the values that are closest)
  data$colmean[i] <- mean(unlist(data.i[pos])) # calculate mean
}

data
#    id column1 column2 column3 colmean
# 1 101       2       3       5     2.5
# 2 102       4       4      15     4.0
# 3 103       9       2       7     8.0

CodePudding user response:

Borrowing the function findClosest() created here by @Cole, we can do the following,

findClosest <- function(x, n) {
     x <- sort(x)
     x[seq.int(which.min(diff(x, lag = n - 1L)), length.out = n)]
 }


colMeans(apply(data[-1], 1, function(i)findClosest(i, 2)))
#[1] 2.5 4.0 8.0

CodePudding user response:

Here's a self-contained solution, based on the tidyverse, that is independent of the number of columns to be compared.

library(tidyverse)

data %>%
  # Add the means of smallest pairwise differences to the input data
  bind_cols(
    data %>% 
      # Make the data tidy (and hence independent of the number of "column"s)
      pivot_longer(starts_with("column")) %>% 
      # For each id/row (replace with rowwise() if appropriate)
      group_by(id) %>% 
      group_map(
        function(.x, .y) {
          # Form a tibble of all pairwise ciombinations of values
          as_tibble(t(combn(.x$value, 2))) %>% 
            # Calculate pairwise differences
            mutate(difference = abs(V1 - V2)) %>% 
            # Find the smallest pairwise difference
            arrange(difference) %>% 
            head(1) %>% 
            # Calculate the mean of this pair
            pivot_longer(starts_with("V")) %>% 
            summarise(colmean=mean(value))
        }
      ) %>% 
      # Convert list of values to column
      bind_rows()
  )

   id column1 column2 column3 colmean
1 101       2       3       5     2.5
2 102       4       4      15     4.0
3 103       9       2       7     8.0
  • Related