Home > Software design >  Pandas DataFrame finding unique values in subsets
Pandas DataFrame finding unique values in subsets

Time:11-10

Newbie question re pandas.

I cannot seem to wrap my head around pandas groupby & other methods to find and locate unique values in subsets. Let alone doing this efficiently and elegantly.

Here is a sample DataFrame (very simplified):

df = pd.DataFrame([
[1, 1, True, False, True, True],
[1, 2, True, False, True, True],
[1, 3, True, False, False, True],
[1, 4, True, True, True, True],
[2, 1, False, False, False, True],
[2, 2, True, True, True, True],
[2, 3, True, True, False, True],
[2, 4, True, True, True, True],
[3, 1, True, True, True, True],
[3, 2, True, True, True, True],
[3, 3, True, True, False, True],
[3, 4, True, True, True, True],
[4, 1, True, True, True, True],
[4, 2, True, True, True, True],
[4, 3, True, True, True, True],
[4, 4, True, True, True, True],
])

What I would like to find falls in two categories:

1/ a list of index and column tuples where I can find True values that only occur once in a column for the same values of a given filter, e.g. grouped by the first or the second columns.

2/ a list of index and column tuples where I can find True values that only occur once in a row.

df['count'] = df.apply(lambda x: x[x == True].count(), axis = 1)
>>> df
    0  1      2      3      4     5  count
0   1  1   True  False   True  True      5
1   1  2   True  False   True  True      4
2   1  3   True  False  False  True      3
3   1  4   True   True   True  True      5
4   2  1  False  False  False  True      2
5   2  2   True   True   True  True      4
6   2  3   True   True  False  True      3
7   2  4   True   True   True  True      4
8   3  1   True   True   True  True      5
9   3  2   True   True   True  True      4
10  3  3   True   True  False  True      3
11  3  4   True   True   True  True      4
12  4  1   True   True   True  True      5
13  4  2   True   True   True  True      4
14  4  3   True   True   True  True      4
15  4  4   True   True   True  True      4

First this counts the ones, not just the True values, but I have not found a way to select and do the count only on columns 2-5.
Next, I really want to find out the index and column for those unique True values in the row.

The result I hope to get, combined between points 1 and 2, is:

[(3, 3), (14, 4), (4, 5)]

In no particular order. Any pointer to sample code that achieves a similar goal would be greatly appreciated.

I know how to do this procedurally, it is rather straight forward, but I would like to find a pandas way of doing things that does not involve for loops, as a way to learn to use pandas better.

CodePudding user response:

Use:

#filter columns between 2:6
#df1 = df.loc[:, 2:6]

#or select only boolean columns
df1 = df.select_dtypes(bool)

#filter rows with only one True per rows
s = df1[df1.sum(axis=1).eq(1)]

#filter rows by one True per groups per columns
f = lambda x: x.loc[:, x.sum(axis=0).eq(1)]
df2 = df1.groupby(df[0]).apply(f)
df3 = df1.groupby(df[1]).apply(f)

#join together and reshape
s = pd.concat([s, df2, df3]).stack()

#get tuples by filtering Trues
tup = s.index[s].tolist()
print (tup)
[(4, 5), (3, 3), (14, 4)]
  • Related