Newbie question re pandas.
I cannot seem to wrap my head around pandas groupby & other methods to find and locate unique values in subsets. Let alone doing this efficiently and elegantly.
Here is a sample DataFrame (very simplified):
df = pd.DataFrame([
[1, 1, True, False, True, True],
[1, 2, True, False, True, True],
[1, 3, True, False, False, True],
[1, 4, True, True, True, True],
[2, 1, False, False, False, True],
[2, 2, True, True, True, True],
[2, 3, True, True, False, True],
[2, 4, True, True, True, True],
[3, 1, True, True, True, True],
[3, 2, True, True, True, True],
[3, 3, True, True, False, True],
[3, 4, True, True, True, True],
[4, 1, True, True, True, True],
[4, 2, True, True, True, True],
[4, 3, True, True, True, True],
[4, 4, True, True, True, True],
])
What I would like to find falls in two categories:
1/ a list of index and column tuples where I can find True values that only occur once in a column for the same values of a given filter, e.g. grouped by the first or the second columns.
2/ a list of index and column tuples where I can find True values that only occur once in a row.
df['count'] = df.apply(lambda x: x[x == True].count(), axis = 1)
>>> df
0 1 2 3 4 5 count
0 1 1 True False True True 5
1 1 2 True False True True 4
2 1 3 True False False True 3
3 1 4 True True True True 5
4 2 1 False False False True 2
5 2 2 True True True True 4
6 2 3 True True False True 3
7 2 4 True True True True 4
8 3 1 True True True True 5
9 3 2 True True True True 4
10 3 3 True True False True 3
11 3 4 True True True True 4
12 4 1 True True True True 5
13 4 2 True True True True 4
14 4 3 True True True True 4
15 4 4 True True True True 4
First this counts the ones, not just the True values, but I have not found a way to select and do the count only on columns 2-5.
Next, I really want to find out the index and column for those unique True values in the row.
The result I hope to get, combined between points 1 and 2, is:
[(3, 3), (14, 4), (4, 5)]
In no particular order. Any pointer to sample code that achieves a similar goal would be greatly appreciated.
I know how to do this procedurally, it is rather straight forward, but I would like to find a pandas way of doing things that does not involve for loops, as a way to learn to use pandas better.
CodePudding user response:
Use:
#filter columns between 2:6
#df1 = df.loc[:, 2:6]
#or select only boolean columns
df1 = df.select_dtypes(bool)
#filter rows with only one True per rows
s = df1[df1.sum(axis=1).eq(1)]
#filter rows by one True per groups per columns
f = lambda x: x.loc[:, x.sum(axis=0).eq(1)]
df2 = df1.groupby(df[0]).apply(f)
df3 = df1.groupby(df[1]).apply(f)
#join together and reshape
s = pd.concat([s, df2, df3]).stack()
#get tuples by filtering Trues
tup = s.index[s].tolist()
print (tup)
[(4, 5), (3, 3), (14, 4)]