Home > OS >  How to mean center a column with varying duplicates?
How to mean center a column with varying duplicates?

Time:02-21

Suppose we have a dataframe df with a structure like a the following:

df <- read.table(header = TRUE, text = "
subj   age   trial    
 A       9      1       
 A       9      2      
 A       9      3       
 B      10      1
 B      10      2
 C       8      1       
 C       8      2
 C       8      3
 C       8      4")      

How to elegantly remove the mean age of all subjects from each subject so that the resulting dataframe would look like below?

subj   age   trial    
 A       0      1       
 A       0      2      
 A       0      3       
 B       1      1
 B       1      2
 C      -1      1       
 C      -1      2
 C      -1      3
 C      -1      4   

P.S. - Notice that each subject (first column) has multiple rows, and thus the associated age (second column) is repeated. The number of repetitions depends on the number of trials (third column) and is different across subjects. Thus, the average average across all subjects should be computed without those duplicates.

CodePudding user response:

Use the mutate function as below. However, note that the mean of your age variable is 8.777778. So you won't get exactly output as you expect.

library(tidyverse)

df = read.table(
  header = TRUE,text="
subj   age   trial    
 A       9      1       
 A       9      2      
 A       9      3       
 B      10      1
 B      10      2
 C       8      1       
 C       8      2
 C       8      3
 C       8      4    
") %>% as_tibble()

df %>% mutate(age = age - mean(age))

output

# A tibble: 9 x 3
  subj     age trial
  <chr>  <dbl> <int>
1 A      0.222     1
2 A      0.222     2
3 A      0.222     3
4 B      1.22      1
5 B      1.22      2
6 C     -0.778     1
7 C     -0.778     2
8 C     -0.778     3
9 C     -0.778     4

Even if you use rounding

df %>% mutate(age = age - round(mean(age)))

output

 A tibble: 9 x 3
  subj    age trial
  <chr> <dbl> <int>
1 A         0     1
2 A         0     2
3 A         0     3
4 B         1     1
5 B         1     2
6 C        -1     1
7 C        -1     2
8 C        -1     3
9 C        -1     4

If you absolutely must get the result as shown in the question, you must subtract 1 from the results <0.

df %>% mutate(age = age - round(mean(age)),
              age = ifelse(age<0, age-1, age))

output

# A tibble: 9 x 3
  subj    age trial
  <chr> <dbl> <int>
1 A         0     1
2 A         0     2
3 A         0     3
4 B         1     1
5 B         1     2
6 C        -2     1
7 C        -2     2
8 C        -2     3
9 C        -2     4

Update 1

If you don't want to use information from the trial variable, you can do something like this:

mage = df %>% 
  distinct(subj, age) %>% 
  pull(age) %>% mean()

df %>% mutate(age = age - mage)

or like this:

mage = df %>% 
  group_by(subj) %>% 
  summarise(age = min(age)) %>% 
  pull(age) %>% mean()

df %>% mutate(age = age - mage)

output

# A tibble: 9 x 3
  subj    age trial
  <chr> <dbl> <int>
1 A         0     1
2 A         0     2
3 A         0     3
4 B         1     1
5 B         1     2
6 C        -1     1
7 C        -1     2
8 C        -1     3
9 C        -1     4

CodePudding user response:

library(dplyr)
df %>%
  mutate(mean_age = mean(age[trial == 1], na.rm = TRUE)) %>%
  mutate(age = age - mean_age)
# A tibble: 9 × 4
  subj    age trial mean_age
  <chr> <dbl> <int>    <dbl>
1 A         0     1        9
2 A         0     2        9
3 A         0     3        9
4 B         1     1        9
5 B         1     2        9
6 C        -1     1        9
7 C        -1     2        9
8 C        -1     3        9
9 C        -1     4        9

EDIT: With a random trial per subj:

df %>%
  group_by(subj) %>%
  mutate(trial_sample = sample(trial, 1)) %>%
  ungroup() %>%
  mutate(mean_age = mean(age[trial_sample], na.rm = TRUE)) %>%
  mutate(age = age - mean_age)

Data (thanks goes to @Marek Fiolka):

df = read.table(
  header = TRUE,text="
subj   age   trial    
 A       9      1       
 A       9      2      
 A       9      3       
 B      10      1
 B      10      2
 C       8      1       
 C       8      2
 C       8      3
 C       8      4    
") %>% as_tibble()
  • Related