I am trying to compare the data in two data frames for finding rows that are either missing or having different values under the header "A-Score". However, my script gives results that is a bit unexpected. How to fix this?
import pandas as pd
print(df1)
Ensembl_ID length score identity p_value R-Site \
0 ENSG00000000457 208 42.98 92.857 4.390000e-34 110
1 ENSG00000000457 208 42.98 92.857 4.390000e-34 133
2 ENSG00000000457 208 42.98 92.857 4.390000e-34 149
3 ENSG00000000457 208 42.98 92.857 4.390000e-34 164
4 ENSG00000000460 349 56.10 100.000 0.000000e 00 90
5 ENSG00000000460 349 56.10 100.000 0.000000e 00 168
6 ENSG00000000460 349 56.10 100.000 0.000000e 00 187
7 ENSG00000000460 349 56.10 100.000 0.000000e 00 297
8 ENSG00000000460 349 56.10 100.000 0.000000e 00 317
9 ENSG00000000460 349 56.10 100.000 0.000000e 00 336
10 ENSG00000004399 656 130.45 100.000 0.000000e 00 134
11 ENSG00000004399 656 130.45 100.000 0.000000e 00 151
12 ENSG00000004399 656 130.45 100.000 0.000000e 00 153
13 ENSG00000004399 656 130.45 100.000 0.000000e 00 204
14 ENSG00000004399 656 130.45 100.000 0.000000e 00 290
15 ENSG00000004399 656 130.45 100.000 0.000000e 00 298
16 ENSG00000004399 656 130.45 100.000 0.000000e 00 342
17 ENSG00000004399 656 130.45 100.000 0.000000e 00 362
18 ENSG00000004399 656 130.45 100.000 0.000000e 00 431
19 ENSG00000004399 656 130.45 100.000 0.000000e 00 434
20 ENSG00000004399 656 130.45 100.000 0.000000e 00 514
21 ENSG00000004399 656 130.45 100.000 0.000000e 00 516
22 ENSG00000004399 656 130.45 100.000 0.000000e 00 556
23 ENSG00000004399 656 130.45 100.000 0.000000e 00 576
R-PercentPosition R-Score
0 52.884615 0.147
1 63.942308 0.040
2 71.634615 0.105
3 78.846154 0.063
4 25.787966 0.711
5 48.137536 0.094
6 53.581662 0.252
7 85.100287 0.726
8 90.830946 0.024
9 96.275072 0.001
10 20.426829 0.015
11 23.018293 0.017
12 23.323171 0.528
13 31.097561 0.044
14 44.207317 0.008
15 45.426829 0.111
16 52.134146 0.382
17 55.182927 0.042
18 65.701220 0.002
19 66.158537 0.001
20 78.353659 0.014
21 78.658537 0.872
22 84.756098 0.243
23 87.347561 0.115
print(df2)
Ensembl_ID length score identity p_value A-Site \
0 ENSG00000000457 208 42.98 92.857 4.390000e-34 133
1 ENSG00000000457 208 42.98 92.857 4.390000e-34 149
2 ENSG00000000457 208 42.98 92.857 4.390000e-34 164
3 ENSG00000000460 349 56.61 100.000 0.000000e 00 90
4 ENSG00000000460 349 56.61 100.000 0.000000e 00 168
5 ENSG00000000460 349 56.61 100.000 0.000000e 00 187
6 ENSG00000000460 349 56.61 100.000 0.000000e 00 297
7 ENSG00000000460 349 56.61 100.000 0.000000e 00 317
8 ENSG00000000460 349 56.61 100.000 0.000000e 00 336
9 ENSG00000004399 656 131.30 100.000 0.000000e 00 134
10 ENSG00000004399 656 131.30 100.000 0.000000e 00 151
11 ENSG00000004399 656 131.30 100.000 0.000000e 00 153
12 ENSG00000004399 656 131.30 100.000 0.000000e 00 204
13 ENSG00000004399 656 131.30 100.000 0.000000e 00 290
14 ENSG00000004399 656 131.30 100.000 0.000000e 00 298
15 ENSG00000004399 656 131.30 100.000 0.000000e 00 342
16 ENSG00000004399 656 131.30 100.000 0.000000e 00 362
17 ENSG00000004399 656 131.30 100.000 0.000000e 00 431
18 ENSG00000004399 656 131.30 100.000 0.000000e 00 434
19 ENSG00000004399 656 131.30 100.000 0.000000e 00 514
20 ENSG00000004399 656 131.30 100.000 0.000000e 00 516
21 ENSG00000004399 656 131.30 100.000 0.000000e 00 556
22 ENSG00000004399 656 131.30 100.000 0.000000e 00 573
A-PercentPosition A-Score
0 63.942308 0.040
1 71.634615 0.105
2 78.846154 0.063
3 25.787966 0.711
4 48.137536 0.094
5 53.581662 0.252
6 85.100287 0.726
7 90.830946 0.024
8 96.275072 0.001
9 20.426829 0.251
10 23.018293 0.148
11 23.323171 0.021
12 31.097561 0.099
13 44.207317 0.070
14 45.426829 0.065
15 52.134146 0.115
16 55.182927 0.024
17 65.701220 0.425
18 66.158537 0.413
19 78.353659 0.469
20 78.658537 0.519
21 84.756098 0.506
22 87.347561 0.169
df1['compare_Scores'] = df1['R-Score'].isin(df2['A-Score'])
print(df1)
Ensembl_ID length score identity p_value R-Site \
0 ENSG00000000457 208 42.98 92.857 4.390000e-34 110
1 ENSG00000000457 208 42.98 92.857 4.390000e-34 133
2 ENSG00000000457 208 42.98 92.857 4.390000e-34 149
3 ENSG00000000457 208 42.98 92.857 4.390000e-34 164
4 ENSG00000000460 349 56.10 100.000 0.000000e 00 90
5 ENSG00000000460 349 56.10 100.000 0.000000e 00 168
6 ENSG00000000460 349 56.10 100.000 0.000000e 00 187
7 ENSG00000000460 349 56.10 100.000 0.000000e 00 297
8 ENSG00000000460 349 56.10 100.000 0.000000e 00 317
9 ENSG00000000460 349 56.10 100.000 0.000000e 00 336
10 ENSG00000004399 656 130.45 100.000 0.000000e 00 134
11 ENSG00000004399 656 130.45 100.000 0.000000e 00 151
12 ENSG00000004399 656 130.45 100.000 0.000000e 00 153
13 ENSG00000004399 656 130.45 100.000 0.000000e 00 204
14 ENSG00000004399 656 130.45 100.000 0.000000e 00 290
15 ENSG00000004399 656 130.45 100.000 0.000000e 00 298
16 ENSG00000004399 656 130.45 100.000 0.000000e 00 342
17 ENSG00000004399 656 130.45 100.000 0.000000e 00 362
18 ENSG00000004399 656 130.45 100.000 0.000000e 00 431
19 ENSG00000004399 656 130.45 100.000 0.000000e 00 434
20 ENSG00000004399 656 130.45 100.000 0.000000e 00 514
21 ENSG00000004399 656 130.45 100.000 0.000000e 00 516
22 ENSG00000004399 656 130.45 100.000 0.000000e 00 556
23 ENSG00000004399 656 130.45 100.000 0.000000e 00 573
R-PercentPosition R-Score compare_Scores
0 52.884615 0.147 False
1 63.942308 0.040 True
2 71.634615 0.105 True
3 78.846154 0.063 True
4 25.787966 0.711 True
5 48.137536 0.094 True
6 53.581662 0.252 True
7 85.100287 0.726 True
8 90.830946 0.024 True
9 96.275072 0.001 True
10 20.426829 0.015 False
11 23.018293 0.017 False
12 23.323171 0.528 False
13 31.097561 0.044 False
14 44.207317 0.008 False
15 45.426829 0.111 False
16 52.134146 0.382 False
17 55.182927 0.042 False
18 65.701220 0.002 False
19 66.158537 0.001 True
20 78.353659 0.014 False
21 78.658537 0.872 False
22 84.756098 0.243 False
23 87.347561 0.115 True
In the result, As expected Row 0 shows "False", as the R-Site value 110 is absent in df2.
But the R-Score values in the rows 19 and 23 are not the same between df1 and df2. Yet, the result shows "True".
Is there a better way to find the differences between df1 and df2 BASED on values in the "R-Score" column?
CodePudding user response:
I do not think you are not doing what you think you are doing.
By issuing df1['compare_Scores'] = df1['R-Score'].isin(df2['A-Score'])
, you are looking at each value in the column R-Score
in df1 (in your example you are talking about the R-Site
value 110, but you are working with R-Score
, not R-Site
), and check if this value is present in the column A-Score
in df2 (not necessarily at the same row index). So for line 19, R-score is 0.001, and it is present in the column A-score of df2, at line 8, so the answer is True.
If what you want to do is to have False when line x in df1['R-Score'] is different from the same line x in df2['A-Score'], and True otherwise, then you could do something like df1['compare_Scores'] = df1['R-Score'] == df2['A-Score']
.
Please note that for this to work you need df1 and df2 indexes to be aligned, which is not the case in your example (df1 has 24 rows indexed from 0 to 23 while df2 has 23 rows indexed from 0 to 22).
CodePudding user response:
The problem with your logic is "isin" will verify if column value found in any one of the index then it returns True. In your sample data 19th index of df2 value 0.001 is present in 8th index of your df1.
df1['compare_Scores'] = df1['R-Score'].isin(df2['A-Score'])
If you wanted to go for index wise comparison below logic works for you.
df1['compare_Scores'] = df1.R-Score == df2.A-Score