How can I find out which items are present in 95% of columns?-CodePudding

I have a data table with 20 unique columns with items in a table.

I want to find the type of items that are present in 95-100% of the columns. So, I want to find items that are present in 19/20 items, but it doesn't matter what combination of columns that is. As long as the item is in 19/20 items, then I want it listed.

I am so at a loss here. I have little experience but am proficient in navigating python and R.

I wanted to try sort & list in python, but it only tells me items that are present in 100% of columns.

If I use dput(df) on my data frame, I get the following (I only listed 2/19 columns)

#> dput(df)
structure(list(RSTM = c("hslU", "hslV", "coaA", "pckA", "divL", 
"tsaE", "group_1759", "addA", "group_2373", "fpgS", "accD", "coaE", 
"dnaQ", "secB", "group_1964", "gyrB_2", "maf", "hemE", "group_1532", 
"rho", "mnmE", "mnmG", "rsmG", "parA", "parB", "group_1348", 
"group_2377", "leuS", "group_2378", "group_2379", "ftsK", "lolA", 
"group_1017", "rnhA", "thrB", "trpS", "murJ_1", "group_1289", 
"xseA", "group_2420", "rimP", "nusA", "infB", "rbfA", "rpsO", 
...
-1136L), class = "data.frame")

CodePudding user response：

Assuming you are using a pandas data frame to hold your data, you can get a set of the unique values, and then iterate over them to determine their occupation in each column.

import pandas as pd

df = pd.read_csv('path/to/file.csv')

def in_col_fraction(df, value):
    in_cols = [(value in df[c]) for c in df]
    return sum(in_cols) / len(in_cols)


# this holds the column presence of all items in your data
presence_dict = {
    value: in_col_fraction(df, value)
    for value in df.stack().unique()
}

# this is the list of just the items that have 95% occupancy or higher
over_95_list = [v for v in df.stack().unique() if (in_col_fraction(df, v) >= 0.95)]

CodePudding user response：

Here’s an R solution. This uses an example dataframe with four columns, and finds values appearing in at least 75% of columns.

all_vals <- unique(unlist(dat))

pct_w_val <- sapply(
  all_vals, 
  \(val) mean(sapply(dat, \(col) val %in% col))
)

all_vals[pct_w_val >= .75]
# [1] "x" "e" "p" "v" "s" "q" "o" "w" "n"

Example data:

set.seed(13)

dat <- data.frame(
  a = sample(letters, 20, replace = TRUE),
  b = sample(letters, 20, replace = TRUE),
  c = sample(letters, 20, replace = TRUE),
  d = sample(letters, 20, replace = TRUE)
)

dat

   a b c d
1  x x q a
2  c o l h
3  e q x i
4  j q j a
5  m k v v
6  f d w w
7  p x g a
8  v w o w
9  d n b p
10 x u q o
11 s c n s
12 q c y o
13 j e y f
14 o d r h
15 w b s n
16 w a e k
17 e s r i
18 l x n t
19 q s m q
20 v v p y