Home > other >  Creating loop to count the number of unique values in column based on values in another column
Creating loop to count the number of unique values in column based on values in another column

Time:09-15

So, for example, I have the following dataframe, data:

col1 col2
1 5
1 5
1 3
2 10
2 11
3 11

Now, I want to make a new column, col3, which gives me the number of unique values in col2 for every grouping in col1.

So far, I have the following code:

length(unique(data$col2[data$col1 == 1]))

Which would here return the number 2.

However, I'm having a hard time making a loop that goes through all the values in col1 to create the new column, col3.

CodePudding user response:

You want the counts for every row, so using a for loop you would do

data$col3 <- NA_real_
for (i in seq_len(nrow(data))) {
  data$col3[i] <- length(unique(data$col2[data$col1 == data$col1[i]]))
}
data
#   col1 col2 col3
# 1    1    5    2
# 2    1    5    2
# 3    1    3    2
# 4    2   10    2
# 5    2   11    2
# 6    3   11    1

However, using for loops in R is mostly inefficient, and in this case we can use the grouping function ave which comes with R.

data <- transform(data, col3=ave(col2, col1, FUN=\(x) length(unique(x))))
data
#   col1 col2 col3
# 1    1    5    2
# 2    1    5    2
# 3    1    3    2
# 4    2   10    2
# 5    2   11    2
# 6    3   11    1

Data:

data <- structure(list(col1 = c(1L, 1L, 1L, 2L, 2L, 3L), col2 = c(5L, 
5L, 3L, 10L, 11L, 11L)), class = "data.frame", row.names = c(NA, 
-6L))

CodePudding user response:

We can use n_distinct after grouping

library(dplyr)
data <- data %>%
   group_by(col1) %>%
   mutate(col3 = n_distinct(col2)) %>%
   ungroup

-output

data
# A tibble: 6 × 3
   col1  col2  col3
  <int> <int> <int>
1     1     5     2
2     1     5     2
3     1     3     2
4     2    10     2
5     2    11     2
6     3    11     1

Or with data.table

library(data.table)
setDT(data)[, col3 := uniqueN(col2), col1]

data

data <- structure(list(col1 = c(1L, 1L, 1L, 2L, 2L, 3L), col2 = c(5L, 
5L, 3L, 10L, 11L, 11L)), class = "data.frame", row.names = c(NA, 
-6L))
  • Related