I have a dataset with thousands of rows and almost a hundred columns. Each row only contains unique elements, however, these elements may also be found in other rows.
Basically, I want to create two new columns in my data frame, one to store how many Unique and another to store how many Ambiguous elements there are in a given row but compared to the whole dataset.
Note there are NAs in the dataframe that should not be considered when counting unique and ambiguous elements.
df <- data.frame(
col1 = c('Ab', 'Cd', 'Ef', 'Gh', 'Ij'),
col2 = c('Ac', 'Ce', 'Eg', 'Gi', 'Ik'),
col3 = c('Acc', NA, 'Ab', 'Gef', 'Il'),
col4 = c(NA, NA, NA, 'Ce', 'Im')
)
In the dataframe created above, Ab is not unique, so in row 1 there are 2 unique and 1 ambiguous elements when compared to the whole dataset.
In my expected output, Unique in row 1 would be equal to 2, and Ambiguous = 1. In row five, it would be 4 and 0, respectively.
I've searched for possible solutions, but most only deals with unique or repeated elements in a particular row, or across multiple rows for a particular column. Anyway, any help would be greatly appreciated.
CodePudding user response:
Another method avoiding some recalculations.
# First we get the duplicates to avoid recounting every time.
freqs <- table(as.matrix(df))
dupes <- names(freqs[freqs > 1])
# Check the values for (non-)duplication.
is_dupe <- rowSums(apply(df, 2, "%in%", dupes))
not_dupe <- rowSums(apply(df, 2, function(x) {!(x %in% dupes | is.na(x))}))
# Add the columns after we calculated the counts to avoid including them.
df$ambiguous <- is_dupe
df$unique <- not_dupe
df
# col1 col2 col3 col4 ambiguous unique
# 1 Ab Ac Acc <NA> 1 2
# 2 Cd Ce <NA> <NA> 1 1
# 3 Ef Eg Ab <NA> 1 2
# 4 Gh Gi Gef Ce 1 3
# 5 Ij Ik Il Im 0 4
CodePudding user response:
How about something like this:
df <- data.frame(
col1 = c('Ab', 'Cd', 'Ef', 'Gh', 'Ij'),
col2 = c('Ac', 'Ce', 'Eg', 'Gi', 'Ik'),
col3 = c('Acc', NA, 'Ab', 'Gef', 'Il'),
col4 = c(NA, NA, NA, 'Ce', 'Im')
)
uvals <- avals <- rep(NA, nrow(df))
for(i in 1:nrow(df)){
other_vals <- na.omit(c(unique(as.matrix(df[-i,]))))
tmp <- na.omit(as.matrix(df)[i,]) %in% other_vals
uvals[i] <- sum(tmp == 0, na.rm=TRUE)
avals[i] <- sum(tmp == 1, na.rm=TRUE)
}
df <- df %>%
mutate(unique = uvals,
ambiguous = avals)
df
# col1 col2 col3 col4 unique ambiguous
# 1 Ab Ac Acc <NA> 2 1
# 2 Cd Ce <NA> <NA> 1 1
# 3 Ef Eg Ab <NA> 2 1
# 4 Gh Gi Gef Ce 3 1
# 5 Ij Ik Il Im 4 0