Find rows that have closest columns' values to a specific row in a data.frame-CodePudding

Imagine we have one row in the data below as our reference (row # 116).

How can I find any other rows in this data whose columns' values are the same or the closest (if column value is numerical, lets say up to /- 3 is an acceptable match) to the columns' values of this reference row?

For example, if the column value for variable prof in the reference row is beginner, we want to find another row whose value for prof is also beginner.

Or if the column value for variable study_length in the reference row is 5, we want to find another row whose value for study_length is also 5 /- 3 and so on.

Is it possible to set up a function do this in R?

data <- read.csv("https://raw.githubusercontent.com/hkil/m/master/wcf.csv")[-c(2:6,12,17)])

reference <- data[116,]

############################# YOUR POSSIBLE ANSWER:

foo <- function(data = data, reference_row = 116, tolerance_for_numerics = 3) {

# your solution


}

# Example of use:

foo()

CodePudding user response：

Here is a solution.

foo <- function(x = data, reference_row = 116, tolerance_for_numerics = 3) {
  # which columns are numeric
  i <- sapply(x, is.numeric)
  reference <- x[reference_row, ]
  # numeric columns are within a range
  num <- mapply(\(y, ref, tol) {
    y >= ref - tol & y <= ref   tol
  }, data[i], reference[i], MoreArgs = list(tol = 3))
  # other columns must match exactly (?)
  other <- mapply(\(y, ref) {
    y == ref
  }, data[!i], reference[!i])
  which(rowSums(cbind(other, num)) == ncol(data))
}

data <- read.csv("https://raw.githubusercontent.com/hkil/m/master/wcf.csv")[-c(2:6,12,17)]

# Example of use:
foo()
#> [1] 112 114 116

^{Created on 2022-08-13 by the reprex package (v2.0.1)}

CodePudding user response：

Could this one be an option for you. I am not sure:

library(dplyr)

foo <- function(data = data, reference_row = 116, tolerance_for_numerics = 3) {
  
  data %>% 
    filter(study == study[reference_row]) %>% 
    filter(study_length >= study_length-tolerance_for_numerics & 
             study_length <= study_length tolerance_for_numerics) 
}

foo(data, 116, 3)

         study     prof age_grp wcf_scope wcf_type   err_type lang_setting    res_setting study_length        des_type
1 Nemati et al beginner    teen   focused   direct verb_tense           FL lang institute            5 true_experiment
2 Nemati et al beginner    teen   focused location verb_tense           FL lang institute            5 true_experiment
3 Nemati et al beginner    teen   focused   direct verb_tense           FL lang institute            5 true_experiment
4 Nemati et al beginner    teen   focused location verb_tense           FL lang institute            5 true_experiment
5 Nemati et al beginner    teen   focused   direct verb_tense           FL lang institute            5 true_experiment
6 Nemati et al beginner    teen   focused location verb_tense           FL lang institute            5 true_experiment