Hello I have a dataframe such as
COL1 COL2
A seq1
B seq1
A seq2
A seq3
A seq4
A seq5
B seq9
B seq4
And I would like to only keep row where A and B within COL1
have the same COL2
value.
I should then get;
COL1 COL2
A seq1
B seq1
A seq4
B seq4
Does someone have an idea please ?
CodePudding user response:
You can group by COL2
(unintuitive, isn't it?), and get the number of unique items of COL1
for each group:
filtered_df = df[df.groupby('COL2')['COL1'].transform('nunique') == 2]
Output:
>>> filtered_df
COL1 COL2
0 A seq1
1 B seq1
4 A seq4
7 B seq4
CodePudding user response:
If you COL1
has only 2 values, you could also filter df
for values in "COL1", merge
on COL2
; then melt
:
out = (pd.melt(df[df['COL1']=='A'].merge(df[df['COL1']=='B'], on='COL2'),
id_vars=['COL2'], value_name='COL1')
.drop(columns='variable')[['COL1', 'COL2']])
Output:
COL1 COL2
0 A seq1
1 A seq4
2 B seq1
3 B seq4
CodePudding user response:
You can also use pd.crosstab
:
seq = pd.crosstab(df['COL1'], df['COL2']).all().loc[lambda x: x].index
out = df.loc[df['COL2'].isin(seq)]
print(out)
# Output
COL1 COL2
0 A seq1
1 B seq1
4 A seq4
7 B seq4
Intermediate result:
>>> pd.crosstab(df['COL1'], df['COL2'])
COL2 seq1 seq2 seq3 seq4 seq5 seq9
COL1
A 1 1 1 1 1 0
B 1 0 0 1 0 1