r average of distance by Id-CodePudding

I have a dataset with two groups of subjects, Group A, Group B like this.

 Id  Group  Age
 1   A      17
 2   A      14
 3   A      10
 4   A      17
 5   A      12
 6   A      6
 7   A      18
 8   A      7

 9   B      18
 9   B      13
10   B      6
10   B      12
11   B      16
11   B      17
12   B      11
12   B      18

The subjects in Group A are unique. One row per subject. The subjects in Group B are not unique. There are two or in some cases 3 rows of observations per subject in Group B, example ID 9, 10, 10 etc.

What I am trying to do is a) estimate the average distance of subjects in GroupB to everyone in Group A. Using Age to estimate the distance.

b) estimate the distance of subjects in GroupB to the mode of subjects in Group A. Using Age to estimate the mode in Group A and Age in Group B to estimate the distance from the mode.

Expecting a dataset like this.

  ID    Group   Age   AvDistance   DistanceToMedian
   1    A       17    NA           NA
   2    A       14    NA           NA
   3    A       10    NA           NA
   4    A       17    NA           NA
   5    A       12    NA           NA
   6    A       6     NA           NA
   7    A       18    NA           NA
   8    A       7     NA           NA

   9    B      18    6             2.11
   9    B      13    3.875         2.88
  10    B      6     ...            ...
  10    B      12    ...            ...
  11    B      16    ...            ...
  11    B      17    ...            ...
  12    B      11    ...            ...
  12    B      18    ...            ...

I can do this manually, any suggestions on how to make this more efficient is much appreciated. Thanks.

# Estimate Average Distance of Id in Group B to all subjects in Group A
(sqrt((17 - 18)^2)  sqrt((14-18)^2)  sqrt((10-18)^2)   sqrt((17-18)^2)   sqrt((12-18)^2)   sqrt((6-18)^2)   sqrt((18-18)^2)   sqrt((7-18)^2))/8 = 6

(sqrt((17 - 13)^2)  sqrt((14-13)^2)  sqrt((10 - 13)^2)   sqrt((17-13)^2)   sqrt((12-13)^2)   sqrt((6-13)^2)   sqrt((18-13)^2)   sqrt((7-13)^2))/8 = 3.875

estimate_mode <- function(x) {
  d <- density(x)
  d$x[which.max(d$y)]
}

# Estimate Mode for Age in Group A
x <- c(17, 14, 10, 17, 12, 6, 18, 7)
estimate_mode(x)

m1 <- estimate_mode(x)

# Estimate Mode of 
sqrt((18 - m1)^2) = 2.11
sqrt((13 - m1)^2) =2.88

CodePudding user response：

This will be easier with a unique row ID, so I'll create one:

library(dplyr)
library(tibble)
df = df %>%
  mutate(rownum = paste0("row", row_number()))

ages = setNames(df$Age, df$rownum)

## make a distance matrix
dist = outer(ages[df$Group == "B"], ages[df$Group == "A"], FUN = \(x, y) abs(x - y))

## calculate average distances
av_dist = data.frame(AvDist = rowMeans(dist)) %>% rownames_to_column("rownum")

## calculate median age for A
med_a = median(ages[df$Group == "A"])

## add back to original data
df %>%
  left_join(av_dist, by = "rownum") %>%
  mutate(DistanceToMedian = ifelse(Group == "B", abs(Age - med_a), NA))
#    Id Group Age rownum AvDist DistanceToMedian
# 1   1     A  17   row1     NA               NA
# 2   2     A  14   row2     NA               NA
# 3   3     A  10   row3     NA               NA
# 4   4     A  17   row4     NA               NA
# 5   5     A  12   row5     NA               NA
# 6   6     A   6   row6     NA               NA
# 7   7     A  18   row7     NA               NA
# 8   8     A   7   row8     NA               NA
# 9   9     B  18   row9  5.375                5
# 10  9     B  13  row10  3.875                0
# 11 10     B   6  row11  6.625                7
# 12 10     B  12  row12  3.875                1
# 13 11     B  16  row13  4.375                3
# 14 11     B  17  row14  4.625                4
# 15 12     B  11  row15  4.125                2
# 16 12     B  18  row16  5.375                5

I used median, not mode, because I was looking at your column names, but you can easily swap in your mode instead.

Using this sample data:

df = read.table(text = 'Id  Group  Age
 1   A      17
 2   A      14
 3   A      10
 4   A      17
 5   A      12
 6   A      6
 7   A      18
 8   A      7
 9   B      18
 9   B      13
10   B      6
10   B      12
11   B      16
11   B      17
12   B      11
12   B      18', header = T)