Home > Mobile >  Unexpected results when comparing dataframes' row-wise values in a column
Unexpected results when comparing dataframes' row-wise values in a column

Time:07-08

I am trying to compare the data in two data frames for finding rows that are either missing or having different values under the header "A-Score". However, my script gives results that is a bit unexpected. How to fix this?

import pandas as pd
print(df1)

        Ensembl_ID  length   score  identity       p_value  R-Site  \
0   ENSG00000000457     208   42.98    92.857  4.390000e-34     110   
1   ENSG00000000457     208   42.98    92.857  4.390000e-34     133   
2   ENSG00000000457     208   42.98    92.857  4.390000e-34     149   
3   ENSG00000000457     208   42.98    92.857  4.390000e-34     164   
4   ENSG00000000460     349   56.10   100.000  0.000000e 00      90   
5   ENSG00000000460     349   56.10   100.000  0.000000e 00     168   
6   ENSG00000000460     349   56.10   100.000  0.000000e 00     187   
7   ENSG00000000460     349   56.10   100.000  0.000000e 00     297   
8   ENSG00000000460     349   56.10   100.000  0.000000e 00     317   
9   ENSG00000000460     349   56.10   100.000  0.000000e 00     336   
10  ENSG00000004399     656  130.45   100.000  0.000000e 00     134   
11  ENSG00000004399     656  130.45   100.000  0.000000e 00     151   
12  ENSG00000004399     656  130.45   100.000  0.000000e 00     153   
13  ENSG00000004399     656  130.45   100.000  0.000000e 00     204   
14  ENSG00000004399     656  130.45   100.000  0.000000e 00     290   
15  ENSG00000004399     656  130.45   100.000  0.000000e 00     298   
16  ENSG00000004399     656  130.45   100.000  0.000000e 00     342   
17  ENSG00000004399     656  130.45   100.000  0.000000e 00     362   
18  ENSG00000004399     656  130.45   100.000  0.000000e 00     431   
19  ENSG00000004399     656  130.45   100.000  0.000000e 00     434   
20  ENSG00000004399     656  130.45   100.000  0.000000e 00     514   
21  ENSG00000004399     656  130.45   100.000  0.000000e 00     516   
22  ENSG00000004399     656  130.45   100.000  0.000000e 00     556   
23  ENSG00000004399     656  130.45   100.000  0.000000e 00     576

R-PercentPosition  R-Score  
0           52.884615    0.147  
1           63.942308    0.040  
2           71.634615    0.105  
3           78.846154    0.063  
4           25.787966    0.711  
5           48.137536    0.094  
6           53.581662    0.252  
7           85.100287    0.726  
8           90.830946    0.024  
9           96.275072    0.001  
10          20.426829    0.015  
11          23.018293    0.017  
12          23.323171    0.528  
13          31.097561    0.044  
14          44.207317    0.008  
15          45.426829    0.111  
16          52.134146    0.382  
17          55.182927    0.042  
18          65.701220    0.002  
19          66.158537    0.001  
20          78.353659    0.014  
21          78.658537    0.872  
22          84.756098    0.243  
23          87.347561    0.115 

print(df2)


         Ensembl_ID  length   score  identity       p_value  A-Site  \
0   ENSG00000000457     208   42.98    92.857  4.390000e-34     133   
1   ENSG00000000457     208   42.98    92.857  4.390000e-34     149   
2   ENSG00000000457     208   42.98    92.857  4.390000e-34     164   
3   ENSG00000000460     349   56.61   100.000  0.000000e 00      90   
4   ENSG00000000460     349   56.61   100.000  0.000000e 00     168   
5   ENSG00000000460     349   56.61   100.000  0.000000e 00     187   
6   ENSG00000000460     349   56.61   100.000  0.000000e 00     297   
7   ENSG00000000460     349   56.61   100.000  0.000000e 00     317   
8   ENSG00000000460     349   56.61   100.000  0.000000e 00     336   
9   ENSG00000004399     656  131.30   100.000  0.000000e 00     134   
10  ENSG00000004399     656  131.30   100.000  0.000000e 00     151   
11  ENSG00000004399     656  131.30   100.000  0.000000e 00     153   
12  ENSG00000004399     656  131.30   100.000  0.000000e 00     204   
13  ENSG00000004399     656  131.30   100.000  0.000000e 00     290   
14  ENSG00000004399     656  131.30   100.000  0.000000e 00     298   
15  ENSG00000004399     656  131.30   100.000  0.000000e 00     342   
16  ENSG00000004399     656  131.30   100.000  0.000000e 00     362   
17  ENSG00000004399     656  131.30   100.000  0.000000e 00     431   
18  ENSG00000004399     656  131.30   100.000  0.000000e 00     434   
19  ENSG00000004399     656  131.30   100.000  0.000000e 00     514   
20  ENSG00000004399     656  131.30   100.000  0.000000e 00     516   
21  ENSG00000004399     656  131.30   100.000  0.000000e 00     556   
22  ENSG00000004399     656  131.30   100.000  0.000000e 00     573 

    A-PercentPosition  A-Score  
0           63.942308    0.040  
1           71.634615    0.105  
2           78.846154    0.063  
3           25.787966    0.711  
4           48.137536    0.094  
5           53.581662    0.252  
6           85.100287    0.726  
7           90.830946    0.024  
8           96.275072    0.001  
9           20.426829    0.251  
10          23.018293    0.148  
11          23.323171    0.021  
12          31.097561    0.099  
13          44.207317    0.070  
14          45.426829    0.065  
15          52.134146    0.115  
16          55.182927    0.024  
17          65.701220    0.425  
18          66.158537    0.413  
19          78.353659    0.469  
20          78.658537    0.519  
21          84.756098    0.506  
22          87.347561    0.169 

df1['compare_Scores'] = df1['R-Score'].isin(df2['A-Score'])
print(df1)
         Ensembl_ID  length   score  identity       p_value  R-Site  \
0   ENSG00000000457     208   42.98    92.857  4.390000e-34     110   
1   ENSG00000000457     208   42.98    92.857  4.390000e-34     133   
2   ENSG00000000457     208   42.98    92.857  4.390000e-34     149   
3   ENSG00000000457     208   42.98    92.857  4.390000e-34     164   
4   ENSG00000000460     349   56.10   100.000  0.000000e 00      90   
5   ENSG00000000460     349   56.10   100.000  0.000000e 00     168   
6   ENSG00000000460     349   56.10   100.000  0.000000e 00     187   
7   ENSG00000000460     349   56.10   100.000  0.000000e 00     297   
8   ENSG00000000460     349   56.10   100.000  0.000000e 00     317   
9   ENSG00000000460     349   56.10   100.000  0.000000e 00     336   
10  ENSG00000004399     656  130.45   100.000  0.000000e 00     134   
11  ENSG00000004399     656  130.45   100.000  0.000000e 00     151   
12  ENSG00000004399     656  130.45   100.000  0.000000e 00     153   
13  ENSG00000004399     656  130.45   100.000  0.000000e 00     204   
14  ENSG00000004399     656  130.45   100.000  0.000000e 00     290   
15  ENSG00000004399     656  130.45   100.000  0.000000e 00     298   
16  ENSG00000004399     656  130.45   100.000  0.000000e 00     342   
17  ENSG00000004399     656  130.45   100.000  0.000000e 00     362   
18  ENSG00000004399     656  130.45   100.000  0.000000e 00     431   
19  ENSG00000004399     656  130.45   100.000  0.000000e 00     434   
20  ENSG00000004399     656  130.45   100.000  0.000000e 00     514   
21  ENSG00000004399     656  130.45   100.000  0.000000e 00     516   
22  ENSG00000004399     656  130.45   100.000  0.000000e 00     556   
23  ENSG00000004399     656  130.45   100.000  0.000000e 00     573 

   R-PercentPosition  R-Score  compare_Scores  
0           52.884615    0.147           False  
1           63.942308    0.040            True  
2           71.634615    0.105            True  
3           78.846154    0.063            True  
4           25.787966    0.711            True  
5           48.137536    0.094            True  
6           53.581662    0.252            True  
7           85.100287    0.726            True  
8           90.830946    0.024            True  
9           96.275072    0.001            True  
10          20.426829    0.015           False  
11          23.018293    0.017           False  
12          23.323171    0.528           False  
13          31.097561    0.044           False  
14          44.207317    0.008           False  
15          45.426829    0.111           False  
16          52.134146    0.382           False  
17          55.182927    0.042           False  
18          65.701220    0.002           False  
19          66.158537    0.001            True  
20          78.353659    0.014           False  
21          78.658537    0.872           False  
22          84.756098    0.243           False  
23          87.347561    0.115            True

In the result, As expected Row 0 shows "False", as the R-Site value 110 is absent in df2.

But the R-Score values in the rows 19 and 23 are not the same between df1 and df2. Yet, the result shows "True".

Is there a better way to find the differences between df1 and df2 BASED on values in the "R-Score" column?

CodePudding user response:

I do not think you are not doing what you think you are doing.

By issuing df1['compare_Scores'] = df1['R-Score'].isin(df2['A-Score']), you are looking at each value in the column R-Score in df1 (in your example you are talking about the R-Site value 110, but you are working with R-Score, not R-Site), and check if this value is present in the column A-Score in df2 (not necessarily at the same row index). So for line 19, R-score is 0.001, and it is present in the column A-score of df2, at line 8, so the answer is True.

If what you want to do is to have False when line x in df1['R-Score'] is different from the same line x in df2['A-Score'], and True otherwise, then you could do something like df1['compare_Scores'] = df1['R-Score'] == df2['A-Score'].

Please note that for this to work you need df1 and df2 indexes to be aligned, which is not the case in your example (df1 has 24 rows indexed from 0 to 23 while df2 has 23 rows indexed from 0 to 22).

CodePudding user response:

The problem with your logic is "isin" will verify if column value found in any one of the index then it returns True. In your sample data 19th index of df2 value 0.001 is present in 8th index of your df1.

df1['compare_Scores'] = df1['R-Score'].isin(df2['A-Score'])

If you wanted to go for index wise comparison below logic works for you.

df1['compare_Scores'] = df1.R-Score == df2.A-Score
  • Related