How to create a new df with each "cell" mutated by a formula?-CodePudding

I have a dataset from which I am trying to get the relative abundances. I would like to be able to create a new df in which each "cell" has been transformed as such- new value=(original value)/(sum of column) My dataset has several hundred columns and rows.

df <- structure(list(`raw$MTmatch` = c("Abies", "Abies lasiocarpa OR Abies grandis", "Acer", "Acer OR Acer platanoides", "Agrostis mertensii OR Agrostis exarata OR Poa annua OR Agrostis scabra OR Agrostis stolonifera" ), comb_S026401.R1 = c(0L, 4713L, 0L, 0L, 0L), comb_S026404.R1 = c(0L, 485L, 0L, 0L, 0L), comb_S026406.R1 = c(0L, 5626L, 0L, 0L, 0L), comb_S026409.R1 = c(0L, 2020L, 0L, 0L, 0L), comb_S026412.R1 = c(0L, 331L, 0L, 0L, 0L), comb_S026413.R1 = c(0L, 1394L, 0L, 0L, 0L), comb_S026414.R1 = c(0L, 21L, 0L, 0L, 0L), comb_S026415.R1 = c(0L, 0L, 0L, 10L, 0L), comb_S026416.R1 = c(0L, 271L, 0L, 0L, 0L ), comb_S026419.R1 = c(0L, 0L, 0L, 0L, 0L), comb_S026421.R1 = c(0L, 4689L, 0L, 47L, 0L), comb_S026422.R1 = c(0L, 95L, 0L, 0L, 0L), comb_S026423.R1 = c(0L, 360L, 0L, 0L, 0L), comb_S026427.R1 = c(0L, 666L, 0L, 0L, 0L), comb_S026428.R1 = c(0L, 0L, 0L, 0L, 0L ), comb_S026429.R1 = c(0L, 21L, 0L, 0L, 0L), comb_S026430.R1 = c(0L, 322L, 0L, 0L, 0L), comb_S026431.R1 = c(0L, 1545L, 0L, 9L, 0L), comb_S026432.R1 = c(0L, 2878L, 0L, 0L, 0L), comb_S026433.R1 = c(0L, 908L, 0L, 0L, 0L), comb_S026434.R1 = c(0L, 0L, 0L, 0L, 0L ), comb_S026435.R1 = c(0L, 1961L, 83L, 0L, 0L), comb_S026438.R1 = c(0L, 1944L, 0L, 0L, 0L), comb_S026440.R1 = c(0L, 1955L, 0L, 0L, 0L), comb_S026444.R1 = c(0L, 3372L, 0L, 0L, 0L), comb_S026447.R1 = c(0L, 0L, 0L, 0L, 0L), comb_S026450.R1 = c(0L, 0L, 564L, 0L, 0L ), comb_S026451.R1 = c(0L, 0L, 0L, 0L, 0L), comb_S026453.R1 = c(0L, 2721L, 0L, 0L, 0L), comb_S026456.R1 = c(8L, 10357L, 0L, 0L, 0L), comb_S026457.R1 = c(0L, 89L, 0L, 0L, 0L), comb_S026458.R1 = c(0L, 36L, 0L, 0L, 0L), comb_S026461.R1 = c(0L, 650L, 0L, 0L, 0L ), comb_S026462.R1 = c(0L, 3645L, 0L, 0L, 0L), comb_S026463.R1 = c(0L, 22L, 0L, 0L, 0L), comb_S026464.R1 = c(0L, 0L, 764L, 0L, 0L ), comb_S026466.R1 = c(0L, 0L, 0L, 0L, 0L), comb_S026467.R1 = c(0L, 0L, 0L, 0L, 0L), comb_S026469.R1 = c(0L, 797L, 0L, 0L, 0L ), comb_S026470.R1 = c(0L, 30L, 0L, 0L, 0L), comb_S026471.R1 = c(0L, 6519L, 0L, 0L, 0L), comb_S026473.R1 = c(0L, 26L, 0L, 0L, 0L), comb_S026474.R1 = c(0L, 0L, 2035L, 11L, 0L), comb_S026476.R1 = c(0L, 1181L, 0L, 0L, 0L), comb_S026477.R1 = c(0L, 53L, 0L, 0L, 10L), comb_S026483.R1 = c(0L, 0L, 0L, 0L, 0L), comb_S026484.R1 = c(0L, 170L, 0L, 0L, 0L), comb_S026485.R1 = c(0L, 37L, 0L, 0L, 0L ), comb_S026488.R1 = c(0L, 73L, 0L, 0L, 0L), comb_S026489.R1 = c(0L, 25L, 0L, 17L, 0L), comb_S026490.R1 = c(0L, 19L, 0L, 0L, 0L ), comb_S026493.R1 = c(0L, 3353L, 0L, 0L, 0L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))

is a small sample of my data.

I tried using various formulations of colSums. I tried flipping the rows and columns and turning it into a tibble

 rra<-t(df) rra2<-as_tibble(rra)

and then trying to get the sum of the rows using

rra3<-rra2%>% rowwise() 
%>% mutate(re = sum(c_across(.,))) %>% ungroup()

but this does not work because the top row is the species names (character vector). I do not want to eliminate this row. and even when I eventually figure out how to get the sum across rows, I still can't figure out how to use that sum to then calculate the relative abundance across rows.

CodePudding user response：

Using prop.table:

data.frame(df[, 1], prop.table(as.matrix(df[, -1]), margin = 2))

CodePudding user response：

This is really a simple task, I think it is better to go with base R. In this case, because of the one no-numeric variable, I go for some simple loops. Code below:

ind <- sapply(df, is.numeric)    

col.s <- numeric(NCOL(df))

for (i in seq(from=1, length.out=NCOL(df))) {
  col.s[i] <- if(!ind[i]) as.numeric(NA) else sum(df[, i])
}

for (i in seq(from=1, length.out=NCOL(df))) {
  if(ind[i]) df[, i] <- df[, i] / col.s[i] 
}

CodePudding user response：

It's old-fashioned, but I always think the best way to approach these problems is to tidy the data, that is, to reshape it so it is in long format with one data point per row.

Then, do whatever you want, and then reshape back.

 df %>%
    pivot_longer(-`raw$MTmatch`, names_to = 'column', values_to = 'value') %>%
    group_by(column) %>%
    mutate(value = value / (sum(value))) %>%
    pivot_wider(names_from = 'column', values_from = value)