Is there a way to group this and compare?-CodePudding

Quick question about pandas.

How am I able to find the amount of times this list shows up in this DataFrame?

['A', 'A', 'A', 'A']

Line	Generation	SNP_1	SNP_2	SNP_3	SNP_3
B	2021	A	A	A	A
A	2022	A	C	C	G
A	2022	A	C	T	C
A	2022	A	C	C	G

The idea is: for each row, from column at position [2:] and onwards, does the list match the row, in that same order.

While grouping would be nice it's not neccessary, i.e. groupby(['Line', 'Generation']) and count how many times the list occurs yielding results: "1" for group: Line B/Generation 2021, "0" for Line A/Generation 2022.

CodePudding user response：

grp = df.set_index(["Line", "Generation"]).apply(tuple, axis=1).reset_index(drop=True)
print(df.groupby(["Line", "Generation", grp]).size())
# Line  Generation              
# A     2022        (A, C, C, G)    2
#                   (A, C, T, C)    1
# B     2021        (A, A, A, A)    1
# dtype: int64

CodePudding user response：

One option is to apply a lambda that compares row with list:

out = df.filter(like='SNP').apply(lambda x: x.tolist()==['A', 'A', 'A', 'A'], axis=1).sum()

Another is to use numpy and use its vectorized operations:

out = (df.filter(like='SNP').to_numpy() == np.array(['A', 'A', 'A', 'A'])[:, None]).all(axis=1).sum()

Output:

1  # there is only one row that matches the list in that order