I have a data table with 20 unique columns with items in a table.
I want to find the type of items that are present in 95-100% of the columns. So, I want to find items that are present in 19/20 items, but it doesn't matter what combination of columns that is. As long as the item is in 19/20 items, then I want it listed.
I am so at a loss here. I have little experience but am proficient in navigating python and R.
I wanted to try sort & list in python, but it only tells me items that are present in 100% of columns.
If I use dput(df) on my data frame, I get the following (I only listed 2/19 columns)
#> dput(df)
structure(list(RSTM = c("hslU", "hslV", "coaA", "pckA", "divL",
"tsaE", "group_1759", "addA", "group_2373", "fpgS", "accD", "coaE",
"dnaQ", "secB", "group_1964", "gyrB_2", "maf", "hemE", "group_1532",
"rho", "mnmE", "mnmG", "rsmG", "parA", "parB", "group_1348",
"group_2377", "leuS", "group_2378", "group_2379", "ftsK", "lolA",
"group_1017", "rnhA", "thrB", "trpS", "murJ_1", "group_1289",
"xseA", "group_2420", "rimP", "nusA", "infB", "rbfA", "rpsO",
...
-1136L), class = "data.frame")
CodePudding user response:
Assuming you are using a pandas data frame to hold your data, you can get a set of the unique values, and then iterate over them to determine their occupation in each column.
import pandas as pd
df = pd.read_csv('path/to/file.csv')
def in_col_fraction(df, value):
in_cols = [(value in df[c]) for c in df]
return sum(in_cols) / len(in_cols)
# this holds the column presence of all items in your data
presence_dict = {
value: in_col_fraction(df, value)
for value in df.stack().unique()
}
# this is the list of just the items that have 95% occupancy or higher
over_95_list = [v for v in df.stack().unique() if (in_col_fraction(df, v) >= 0.95)]
CodePudding user response:
Here’s a R solution. This uses an example dataframe with four columns, and find values appearing in at least 75% of columns.
all_vals <- unique(unlist(dat))
pct_w_val <- sapply(
all_vals,
\(val) mean(sapply(dat, \(col) val %in% col))
)
all_vals[pct_w_val >= .75]
# [1] "x" "e" "p" "v" "s" "q" "o" "w" "n"
Example data:
set.seed(13)
dat <- data.frame(
a = sample(letters, 20, replace = TRUE),
b = sample(letters, 20, replace = TRUE),
c = sample(letters, 20, replace = TRUE),
d = sample(letters, 20, replace = TRUE)
)
dat
a b c d
1 x x q a
2 c o l h
3 e q x i
4 j q j a
5 m k v v
6 f d w w
7 p x g a
8 v w o w
9 d n b p
10 x u q o
11 s c n s
12 q c y o
13 j e y f
14 o d r h
15 w b s n
16 w a e k
17 e s r i
18 l x n t
19 q s m q
20 v v p y