Calculating frequency of values by column in R-CodePudding

Does anyone know how to replace the value of a cell with the frequency with which that value occurs in a column? I'm trying to turn a dataframe full of breed labels and factors for genes into a frequency chart (with an eye later to seeing whether animals that common alleles for one gene tend to have common alleles for other genes, too). As an example, my initial dataframe looks like this:

Breed    Gene A     Gene B    Gene C
Collie      3          5         8
Collie      5          7         2
Lab         3          3         1
Pug         3          7         8
Pug         3          7         9
Pug         4          4         9

And I'd like the result to look like this:

Breed    Gene A     Gene B    Gene C
2           4          1         2
2           1          3         1
1           4          1         1
3           4          3         1
3           4          3         2
3           1          1         2

I can see how to do this using a for loop (create new dataframe, loop over each column, loop over each row, change each value to a counter that goes up by one when it encounters an equal value), but is there a simpler and more efficient apply or dplyr approach? The data set is large and I'm going to have re-run this often, and I'm concerned nested for loops will be too slow.

CodePudding user response：

Here's a base R option -

replace_value_by_count <- function(x) ave(x, x, FUN = length)
df[] <- lapply(df, replace_value_by_count)
df

#  Breed GeneA GeneB GeneC
#1     2     4     1     2
#2     2     1     3     1
#3     1     4     1     1
#4     3     4     3     2
#5     3     4     3     2
#6     3     1     1     2

Since you have tagged dplyr, the same function can also be used using dplyr.

library(dplyr)
df <- df %>% mutate(across(.fns = replace_value_by_count))

data

df <- structure(list(Breed = c("Collie", "Collie", "Lab", "Pug", "Pug", 
"Pug"), GeneA = c(3L, 5L, 3L, 3L, 3L, 4L), GeneB = c(5L, 7L, 
3L, 7L, 7L, 4L), GeneC = c(8L, 2L, 1L, 8L, 9L, 9L)), 
class = "data.frame", row.names = c(NA, -6L))

CodePudding user response：

We may use base R

df[] <- lapply(df, function(x) table(x)[as.character(x)])

-output

> df
  Breed GeneA GeneB GeneC
1     2     4     1     2
2     2     1     3     1
3     1     4     1     1
4     3     4     3     2
5     3     4     3     2
6     3     1     1     2

Or using tidyverse

library(dplyr)
df %>%
    mutate(across(everything(), ~ tibble(col1 = .x) %>% 
             add_count(col1) %>% 
             pull(n)))
  Breed GeneA GeneB GeneC
1     2     4     1     2
2     2     1     3     1
3     1     4     1     1
4     3     4     3     2
5     3     4     3     2
6     3     1     1     2

data

df <- structure(list(Breed = c("Collie", "Collie", "Lab", "Pug", "Pug", 
"Pug"), GeneA = c(3L, 5L, 3L, 3L, 3L, 4L), GeneB = c(5L, 7L, 
3L, 7L, 7L, 4L), GeneC = c(8L, 2L, 1L, 8L, 9L, 9L)),
   class = "data.frame", row.names = c(NA, 
-6L))