How to get count difference between two dataframes-CodePudding

I need to compare two dataframes and get the count difference from each columns

DF1:

 --------------- ------ ------ ------- ---------- 
| City          | Temp | Zone | Score | Activity |
 --------------- ------ ------ ------- ---------- 
| Atlanta       | 10   | 1    | 100   | 400      |
 --------------- ------ ------ ------- ---------- 
| Chicago       | 100  | 2    | 200   | 500      |
 --------------- ------ ------ ------- ---------- 
| Boston        | 100  | 3    | 300   | 600      |
 --------------- ------ ------ ------- ---------- 
| San Francisco | 1000 | 4    | 400   | 700      |
 --------------- ------ ------ ------- ----------

DF2:

 --------------- ------ ------ ------- ---------- 
| City          | Temp | Zone | Score | Activity |
 --------------- ------ ------ ------- ---------- 
| Atlanta       | 10   | 1    | 150   | 400      |
 --------------- ------ ------ ------- ---------- 
| Chicago       | 100  | 2    | 200   | 450      |
 --------------- ------ ------ ------- ---------- 
| Boston        | 100  | 3    | 300   | 650      |
 --------------- ------ ------ ------- ---------- 
| San Francisco | 1200 | 4    | 400   | 750      |
 --------------- ------ ------ ------- ----------

Expected output is:

 --------------- ------ ------ ------- ---------- 
| City          | Temp | Zone | Score | Activity |
 --------------- ------ ------ ------- ---------- 
|  0            | 1    | 0    | 0     | 3        |
 --------------- ------ ------ ------- ----------

I used subtract or exceptAll to get the row which is in df1 and not in df2, but unable to find the count difference. Please note that in realtime, these two dataframes have millions of records. Any help would be much appreciated.

CodePudding user response：

Try as follows:

Setup

import pandas as pd

data1 = {'City': {0: 'Atlanta', 1: 'Chicago', 2: 'Boston', 3: 'San Francisco'}, 
         'Temp': {0: 10, 1: 100, 2: 100, 3: 1000}, 
         'Zone': {0: 1, 1: 2, 2: 3, 3: 4}, 
         'Score': {0: 100, 1: 200, 2: 300, 3: 400}, 
         'Activity': {0: 400, 1: 500, 2: 600, 3: 700}}
df1 = pd.DataFrame(data1)

data2 = {'City': {0: 'Atlanta', 1: 'Chicago', 2: 'Boston', 3: 'San Francisco'}, 
         'Temp': {0: 10, 1: 100, 2: 100, 3: 1200}, 
         'Zone': {0: 1, 1: 2, 2: 3, 3: 4}, 
         'Score': {0: 150, 1: 200, 2: 300, 3: 400}, 
         'Activity': {0: 400, 1: 450, 2: 650, 3: 750}}
df2 = pd.DataFrame(data2)

Code

res = df1.compare(df2, keep_shape=True, align_axis=0).count().div(2).to_frame().T

print(res)

   City  Temp  Zone  Score  Activity
0   0.0   1.0   0.0    1.0       3.0

Or:

res = df1.compare(df2, keep_shape=True, align_axis=0)\
    .loc[(slice(None), slice('self')), :].count().to_frame().T

print(res)

   City  Temp  Zone  Score  Activity
0     0     1     0      1         3

Explanation

Use df.compare with keep_shape=True (keeps cols with no differences), and align_axis=0. We get the following df:

print(df1.compare(df2, keep_shape=True, align_axis=0))
        City    Temp  Zone  Score  Activity
0 self   NaN     NaN   NaN  100.0       NaN
  other  NaN     NaN   NaN  150.0       NaN
1 self   NaN     NaN   NaN    NaN     500.0
  other  NaN     NaN   NaN    NaN     450.0
2 self   NaN     NaN   NaN    NaN     600.0
  other  NaN     NaN   NaN    NaN     650.0
3 self   NaN  1000.0   NaN    NaN     700.0
  other  NaN  1200.0   NaN    NaN     750.0

Now, we simple get the count, and divide the result by 2 (applying Series.div), since the count will be for both "self" (df1) and "other" (df2). I.e. it is double. Alternative would be to use df.loc to select only self from index level 1 and apply count afterwards.
Finally, we use Series.to_frame and df.T to get the data in the required shape.

CodePudding user response：

You can use slicing with df.isin and use df.count to get the number of differences.

df1[(~df1.isin(df2))].count()
 
City        0
Temp        1
Zone        0
Score       1
Activity    3
dtype: int64

CodePudding user response：

Since Spark is a distributed computing engine, if you're using the pyspark api without pandas, unless you predefine the order of your dataframe, you can't do the comparison directly since the row order may be completely different.

df1 = spark.createDataFrame(
    [
        ('Atlanta', 10, 1, 100, 400),
        ('Chicago', 100, 2, 200, 500),
        ('Boston', 100, 3, 300, 600),
        ('San Francisco', 1000, 4, 400, 700)
    ],
    schema=['city', 'temp', 'zone', 'score', 'activity']
)

df2 = spark.createDataFrame(
    [
        ('Atlanta', 10, 1, 150, 400),
        ('Chicago', 100, 2, 200, 450),
        ('Boston', 100, 3, 300, 650),
        ('San Francisco', 1200, 4, 400, 750)
    ],
    schema=['city', 'temp', 'zone', 'score', 'activity']
)

df1.show(10, False)
df2.show(10, False)
 ------------- ---- ---- ----- -------- 
|city         |temp|zone|score|activity|
 ------------- ---- ---- ----- -------- 
|Atlanta      |10  |1   |100  |400     |
|Chicago      |100 |2   |200  |500     |
|Boston       |100 |3   |300  |600     |
|San Francisco|1000|4   |400  |700     |
 ------------- ---- ---- ----- -------- 

 ------------- ---- ---- ----- -------- 
|city         |temp|zone|score|activity|
 ------------- ---- ---- ----- -------- 
|Atlanta      |10  |1   |150  |400     |
|Chicago      |100 |2   |200  |450     |
|Boston       |100 |3   |300  |650     |
|San Francisco|1200|4   |400  |750     |
 ------------- ---- ---- ----- --------

Let's say we use the city to do the ordering:

df1 = df1.withColumn('row_num', func.row_number().over(Window.orderBy('city')))
df2 = df2.withColumn('row_num', func.row_number().over(Window.orderBy('city')))

df1.show(10, False)
df2.show(10, False)
 ------------- ---- ---- ----- -------- ------- 
|city         |temp|zone|score|activity|row_num|
 ------------- ---- ---- ----- -------- ------- 
|Atlanta      |10  |1   |100  |400     |1      |
|Boston       |100 |3   |300  |600     |2      |
|Chicago      |100 |2   |200  |500     |3      |
|San Francisco|1000|4   |400  |700     |4      |
 ------------- ---- ---- ----- -------- ------- 

 ------------- ---- ---- ----- -------- ------- 
|city         |temp|zone|score|activity|row_num|
 ------------- ---- ---- ----- -------- ------- 
|Atlanta      |10  |1   |150  |400     |1      |
|Boston       |100 |3   |300  |650     |2      |
|Chicago      |100 |2   |200  |450     |3      |
|San Francisco|1200|4   |400  |750     |4      |
 ------------- ---- ---- ----- -------- -------

Once we collect the row number of each record, we can do the joining:

join_df = df1.selectExpr('row_num', *[f"{col} AS {col}_1" for col in df1.columns[:-1]])\
    .join(
        df2.selectExpr('row_num', *[f"{col} AS {col}_2" for col in df2.columns[:-1]]),
        on='row_num', how='inner'
    )
join_df.show(10, False)
 ------- ------------- ------ ------ ------- ---------- ------------- ------ ------ ------- ---------- 
|row_num|city_1       |temp_1|zone_1|score_1|activity_1|city_2       |temp_2|zone_2|score_2|activity_2|
 ------- ------------- ------ ------ ------- ---------- ------------- ------ ------ ------- ---------- 
|1      |Atlanta      |10    |1     |100    |400       |Atlanta      |10    |1     |150    |400       |
|2      |Boston       |100   |3     |300    |600       |Boston       |100   |3     |300    |650       |
|3      |Chicago      |100   |2     |200    |500       |Chicago      |100   |2     |200    |450       |
|4      |San Francisco|1000  |4     |400    |700       |San Francisco|1200  |4     |400    |750       |
 ------- ------------- ------ ------ ------- ---------- ------------- ------ ------ ------- ----------

The remaining is just the conditional counting:

join_df\
    .select(
        [
            func.count(func.when(func.col(f"{col}_1")!=func.col(f"{col}_2"), func.lit(1))).alias(col)\
            for col in df1.columns[:-1]
        ]
    ).show(100, False)
 ---- ---- ---- ----- -------- 
|city|temp|zone|score|activity|
 ---- ---- ---- ----- -------- 
|0   |1   |0   |1    |3       |
 ---- ---- ---- ----- --------