Quick question about pandas.
How am I able to find the amount of times this list shows up in this DataFrame?
['A', 'A', 'A', 'A']
Line | Generation | SNP_1 | SNP_2 | SNP_3 | SNP_3 |
---|---|---|---|---|---|
B | 2021 | A | A | A | A |
A | 2022 | A | C | C | G |
A | 2022 | A | C | T | C |
A | 2022 | A | C | C | G |
The idea is: for each row, from column at position [2:] and onwards, does the list match the row, in that same order.
While grouping would be nice it's not neccessary, i.e. groupby(['Line', 'Generation']) and count how many times the list occurs yielding results: "1" for group: Line B/Generation 2021, "0" for Line A/Generation 2022.
CodePudding user response:
grp = df.set_index(["Line", "Generation"]).apply(tuple, axis=1).reset_index(drop=True)
print(df.groupby(["Line", "Generation", grp]).size())
# Line Generation
# A 2022 (A, C, C, G) 2
# (A, C, T, C) 1
# B 2021 (A, A, A, A) 1
# dtype: int64
CodePudding user response:
One option is to apply
a lambda that compares row with list:
out = df.filter(like='SNP').apply(lambda x: x.tolist()==['A', 'A', 'A', 'A'], axis=1).sum()
Another is to use numpy and use its vectorized operations:
out = (df.filter(like='SNP').to_numpy() == np.array(['A', 'A', 'A', 'A'])[:, None]).all(axis=1).sum()
Output:
1 # there is only one row that matches the list in that order