I have the following df
:
df=pd.DataFrame({
'Q0_0': ["A vs. Z", "A vs. Bc", "B vs. Z", "B vs Bc", "Bc vs. A", "Bc vs. B", "Z vs. A", "Z vs. B", "C vs. A", "Bc vs. A"],
'Q1_1': [np.random.randint(1,100) for i in range(10)],
'Q1_2': np.random.random(10),
'Q1_3': np.random.randint(2, size=10),
'Q2_1': [np.random.randint(1,100) for i in range(10)],
'Q2_2': np.random.random(10),
'Q2_3': np.random.randint(2, size=10),
'Q3_1': [np.random.randint(1,100) for i in range(10)],
'Q3_2': np.random.random(10),
'Q3_3': np.random.randint(2, size=10),
'Q4_1': [np.random.randint(1,100) for i in range(10)],
'Q4_2': np.random.random(10),
'Q4_3': np.random.randint(2, size=10)
})
It has the following display:
Q0_0 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q4_1 Q4_2 Q4_3
0 A vs. Z 76 0.475198 0 31 0.785794 0 93 0.713219 0 31 0.549401 0
1 A vs. Bc 36 0.441907 0 28 0.008276 1 79 0.132327 0 61 0.657476 1
2 B vs. Z 68 0.474950 0 49 0.401341 1 1 0.409924 0 13 0.471476 0
3 B vs Bc 74 0.462356 0 42 0.762348 0 16 0.337623 1 76 0.548017 1
4 Bc vs. A 63 0.738769 1 34 0.340055 1 74 0.488053 1 84 0.663768 1
5 Bc vs. B 18 0.384001 1 75 0.188500 1 72 0.464784 1 32 0.355016 1
6 Z vs. A 34 0.700306 1 92 0.348228 1 99 0.347391 0 13 0.810568 0
7 Z vs. B 84 0.262367 0 11 0.217050 0 77 0.144048 0 44 0.262738 0
8 C vs. A 90 0.846719 1 53 0.603059 1 53 0.212426 1 86 0.515018 1
9 Bc vs. A 11 0.492974 0 76 0.351270 0 5 0.297710 1 40 0.185969 1
I want a rule allowing me to consider Z vs. A
as duplicate of A vs. Z
and so on for each b vs. a
as a diplicate of a vs. b
in column Q0_0
.
Then proceed with removing those considered as duplicates.
Expected output is :
Q0_0 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q4_1 Q4_2 Q4_3
0 A vs. Z 76 0.475198 0 31 0.785794 0 93 0.713219 0 31 0.549401 0
1 A vs. Bc 36 0.441907 0 28 0.008276 1 79 0.132327 0 61 0.657476 1
2 B vs. Z 68 0.474950 0 49 0.401341 1 1 0.409924 0 13 0.471476 0
3 B vs Bc 74 0.462356 0 42 0.762348 0 16 0.337623 1 76 0.548017 1
8 C vs. A 90 0.846719 1 53 0.603059 1 53 0.212426 1 86 0.515018 1
There is a way to do that in my pandas dataframe ?
Any help from your side will be highly appreciated, thanks.
CodePudding user response:
You can use str.extract
(or str.split
) to get the left/right parts around vs.
, then convert to frozenset
and use duplicated
for boolean indexing:
s = df['Q0_0'].str.extract('(\w )\s*vs\.?\s*(\w )').agg(frozenset, axis=1)
# or
# s = df['Q0_0'].str.split(r'\s*vs\.?\s*', expand=True).agg(frozenset, axis=1)
out = df[~s.duplicated()]
Output:
Q0_0 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q4_1 Q4_2 Q4_3
0 A vs. Z 88 0.664299 0 99 0.102871 0 55 0.905342 0 55 0.789227 1
1 A vs. Bc 71 0.577607 0 99 0.784006 1 39 0.698947 0 82 0.055739 1
2 B vs. Z 81 0.248065 1 9 0.216285 0 13 0.128918 0 49 0.571096 0
3 B vs Bc 95 0.991130 1 80 0.346051 1 54 0.197197 1 30 0.928300 0
8 C vs. A 97 0.440715 0 88 0.986333 1 75 0.161888 0 42 0.831142 0
Intermediates:
s
0 (Z, A)
1 (Bc, A)
2 (Z, B)
3 (Bc, B)
4 (A, Bc)
5 (B, Bc)
6 (Z, A)
7 (Z, B)
8 (C, A)
9 (A, Bc)
dtype: object
~s.duplicated()
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 True
9 False
dtype: bool
CodePudding user response:
I would sort all symbols in the string alphabetically applying function that does something like '.join(sorted(str))
and then just drop_duplicates
.