Home > Blockchain >  Count unique and ambiguous elements in each row compared to the whole dataset
Count unique and ambiguous elements in each row compared to the whole dataset

Time:09-22

I have a dataset with thousands of rows and almost a hundred columns. Each row only contains unique elements, however, these elements may also be found in other rows.

Basically, I want to create two new columns in my data frame, one to store how many Unique and another to store how many Ambiguous elements there are in a given row but compared to the whole dataset.

Note there are NAs in the dataframe that should not be considered when counting unique and ambiguous elements.

df <- data.frame(
  col1 = c('Ab', 'Cd', 'Ef', 'Gh', 'Ij'),
  col2 = c('Ac', 'Ce', 'Eg', 'Gi', 'Ik'), 
  col3 = c('Acc', NA, 'Ab', 'Gef', 'Il'), 
  col4 = c(NA, NA, NA, 'Ce', 'Im')
)

In the dataframe created above, Ab is not unique, so in row 1 there are 2 unique and 1 ambiguous elements when compared to the whole dataset.

In my expected output, Unique in row 1 would be equal to 2, and Ambiguous = 1. In row five, it would be 4 and 0, respectively.

I've searched for possible solutions, but most only deals with unique or repeated elements in a particular row, or across multiple rows for a particular column. Anyway, any help would be greatly appreciated.

CodePudding user response:

Another method avoiding some recalculations.

# First we get the duplicates to avoid recounting every time.
freqs <- table(as.matrix(df))
dupes <- names(freqs[freqs > 1])

# Check the values for (non-)duplication.
is_dupe <- rowSums(apply(df, 2, "%in%", dupes))
not_dupe <- rowSums(apply(df, 2, function(x) {!(x %in% dupes | is.na(x))}))

# Add the columns after we calculated the counts to avoid including them.
df$ambiguous <- is_dupe
df$unique <- not_dupe
df

#   col1 col2 col3 col4 ambiguous unique
# 1   Ab   Ac  Acc <NA>         1      2
# 2   Cd   Ce <NA> <NA>         1      1
# 3   Ef   Eg   Ab <NA>         1      2
# 4   Gh   Gi  Gef   Ce         1      3
# 5   Ij   Ik   Il   Im         0      4

CodePudding user response:

How about something like this:

df <- data.frame(
  col1 = c('Ab', 'Cd', 'Ef', 'Gh', 'Ij'),
  col2 = c('Ac', 'Ce', 'Eg', 'Gi', 'Ik'), 
  col3 = c('Acc', NA, 'Ab', 'Gef', 'Il'), 
  col4 = c(NA, NA, NA, 'Ce', 'Im')
)

uvals <- avals <- rep(NA, nrow(df))

for(i in 1:nrow(df)){
  other_vals <- na.omit(c(unique(as.matrix(df[-i,]))))
  tmp <- na.omit(as.matrix(df)[i,]) %in% other_vals
  uvals[i] <- sum(tmp == 0, na.rm=TRUE)
  avals[i] <- sum(tmp == 1, na.rm=TRUE)
}

df <- df %>% 
  mutate(unique = uvals, 
         ambiguous = avals)

df
#   col1 col2 col3 col4 unique ambiguous
# 1   Ab   Ac  Acc <NA>      2         1
# 2   Cd   Ce <NA> <NA>      1         1
# 3   Ef   Eg   Ab <NA>      2         1
# 4   Gh   Gi  Gef   Ce      3         1
# 5   Ij   Ik   Il   Im      4         0
  • Related