Performing row wise normalization of data in r-CodePudding

Simple operation I would like to do which is proving not to be so simple. So I have a time series data set, and I would like to perform row wise normalization, so for each observation, (x- mean(row))/stdev(row).

This was one attempt but to no avail, and also I've replaced NA values with 0 so that doesn't seem to be the issue.

norm <- for (i in 1:nrow(clusterdatairaq2)){
  for(j in 2:ncol(clusterdatairaq2)) {       
    clusterdatairaq2[i,j] <- (clusterdatairaq2[i,j] - mean(clusterdatairaq2[i,]))/ sd(clusterdatairaq2[i,])
  }
}

Thanks in advance for any help!!

Here is the data and what it looks like

CodePudding user response：

set.seed(42)
mtx <- matrix(sample(99, size=6*5, replace=TRUE), nrow=6)
df <- cbind(data.frame(id = letters[1:6]), mtx)
df
#   id  A  B  C  D  E
# 1  a 49 47 26 95 58
# 2  b 65 24  3  5 97
# 3  c 25 71 41 84 42
# 4  d 74 89 89 34 24
# 5  e 18 37 27 92 30
# 6  f 49 20 36  3 43
out <- t(apply(df[,-1], 1, function(X) (X-mean(X)) / sd(X)))
colnames(out) <- paste0(colnames(df[,-1]), "_norm")
df <- cbind(df, out)
df
#   id  A  B  C  D  E     A_norm     B_norm     C_norm     D_norm     E_norm
# 1  a 49 47 26 95 58 -0.2376354 -0.3168472 -1.1485711  1.5842361  0.1188177
# 2  b 65 24  3  5 97  0.6393668 -0.3611690 -0.8736386 -0.8248320  1.4202728
# 3  c 25 71 41 84 42 -1.1427812  0.7618541 -0.4802994  1.3001207 -0.4388942
# 4  d 74 89 89 34 24  0.3878036  0.8725581  0.8725581 -0.9048751 -1.2280448
# 5  e 18 37 27 92 30 -0.7749098 -0.1291516 -0.4690243  1.7401483 -0.3670625
# 6  f 49 20 36  3 43  1.0067737 -0.5462283  0.3106004 -1.4566088  0.6854630

CodePudding user response：

Assuming we have a data frame like this:

library(dplyr)
df = tibble(
    Destination = c("Belgium", "Bulgaria", "Czechia"),
    `Jan 2008` = sample(1:1000, size=3),
    `Feb 2008` = sample(1:1000, size=3),
    `Mar 2008` = sample(1:1000, size=3)
)
df

# A tibble: 3 × 4
  Destination `Jan 2008` `Feb 2008` `Mar 2008`
  <chr>            <int>      <int>      <int>
1 Belgium            811        299         31
2 Bulgaria           454        922        421
3 Czechia            638        709        940

The tidyverse way to do this (which I think is better than base R here)

library(dplyr)
library(tidyr)

scaled = df %>%
    pivot_longer(`Jan 2008`:`Mar 2008`) %>%
    group_by(Destination) %>%
    mutate(value = as.numeric(scale(value))) %>%
    ungroup()
scaled

  Destination name      value
  <chr>       <chr>     <dbl>
1 Belgium     Jan 2008  1.09 
2 Belgium     Feb 2008 -0.205
3 Belgium     Mar 2008 -0.881
4 Bulgaria    Jan 2008 -0.517
5 Bulgaria    Feb 2008  1.15 
6 Bulgaria    Mar 2008 -0.635
7 Czechia     Jan 2008 -0.787
8 Czechia     Feb 2008 -0.338
9 Czechia     Mar 2008  1.13

Now, you could pivot it back to the original form, but there's not much point, because analysis will be much easier in long form:

scaled %>% pivot_wider(names_from=name, values_from=value)

# A tibble: 3 × 4
  Destination `Jan 2008` `Feb 2008` `Mar 2008`
  <chr>            <dbl>      <dbl>      <dbl>
1 Belgium          1.09      -0.205     -0.881
2 Bulgaria        -0.517      1.15      -0.635
3 Czechia         -0.787     -0.338      1.13

CodePudding user response：

I used the mtcars dataset as an exemple :

library(tidyverse)

mtcars  %>%  #the dataset 
  select(disp) %>% #disp is the row that we want to normalize just as an exemple
  mutate(disp2=(disp-mean(disp))/sd(disp)) #disp2 is the name of the now normalized row

CodePudding user response：

A dplyr solution, re-using @Migwell toy example (please provide a reproducible example in your question):

library(dplyr)
df = data.table(
  Destination = c("Belgium", "Bulgaria", "Czechia"),
  `Jan 2008` = sample(1:1000, size=3),
  `Feb 2008` = sample(1:1000, size=3),
  `Mar 2008` = sample(1:1000, size=3))
> df
   Destination Jan 2008 Feb 2008 Mar 2008
1:     Belgium      443      114      628
2:    Bulgaria      755      801      493
3:     Czechia      123      512      517

You can use:

df2 <- df %>% select(`Jan 2008`:`Mar 2008`) %>% mutate(normJan2008=(`Jan 2008`-rowMeans(.,na.rm=T))/apply(.,1,sd))
> df2
   Jan 2008 Feb 2008 Mar 2008 normJan2008
1:      443      114      628   0.1843742
2:      755      801      493   0.4333577
3:      123      512      517  -1.1546299

And do this for every variable you need to normalize.