I need to compare two dataframes and get the count difference from each columns
DF1:
--------------- ------ ------ ------- ----------
| City | Temp | Zone | Score | Activity |
--------------- ------ ------ ------- ----------
| Atlanta | 10 | 1 | 100 | 400 |
--------------- ------ ------ ------- ----------
| Chicago | 100 | 2 | 200 | 500 |
--------------- ------ ------ ------- ----------
| Boston | 100 | 3 | 300 | 600 |
--------------- ------ ------ ------- ----------
| San Francisco | 1000 | 4 | 400 | 700 |
--------------- ------ ------ ------- ----------
DF2:
--------------- ------ ------ ------- ----------
| City | Temp | Zone | Score | Activity |
--------------- ------ ------ ------- ----------
| Atlanta | 10 | 1 | 150 | 400 |
--------------- ------ ------ ------- ----------
| Chicago | 100 | 2 | 200 | 450 |
--------------- ------ ------ ------- ----------
| Boston | 100 | 3 | 300 | 650 |
--------------- ------ ------ ------- ----------
| San Francisco | 1200 | 4 | 400 | 750 |
--------------- ------ ------ ------- ----------
Expected output is:
--------------- ------ ------ ------- ----------
| City | Temp | Zone | Score | Activity |
--------------- ------ ------ ------- ----------
| 0 | 1 | 0 | 0 | 3 |
--------------- ------ ------ ------- ----------
I used subtract or exceptAll to get the row which is in df1 and not in df2, but unable to find the count difference. Please note that in realtime, these two dataframes have millions of records. Any help would be much appreciated.
CodePudding user response:
Try as follows:
Setup
import pandas as pd
data1 = {'City': {0: 'Atlanta', 1: 'Chicago', 2: 'Boston', 3: 'San Francisco'},
'Temp': {0: 10, 1: 100, 2: 100, 3: 1000},
'Zone': {0: 1, 1: 2, 2: 3, 3: 4},
'Score': {0: 100, 1: 200, 2: 300, 3: 400},
'Activity': {0: 400, 1: 500, 2: 600, 3: 700}}
df1 = pd.DataFrame(data1)
data2 = {'City': {0: 'Atlanta', 1: 'Chicago', 2: 'Boston', 3: 'San Francisco'},
'Temp': {0: 10, 1: 100, 2: 100, 3: 1200},
'Zone': {0: 1, 1: 2, 2: 3, 3: 4},
'Score': {0: 150, 1: 200, 2: 300, 3: 400},
'Activity': {0: 400, 1: 450, 2: 650, 3: 750}}
df2 = pd.DataFrame(data2)
Code
res = df1.compare(df2, keep_shape=True, align_axis=0).count().div(2).to_frame().T
print(res)
City Temp Zone Score Activity
0 0.0 1.0 0.0 1.0 3.0
Or:
res = df1.compare(df2, keep_shape=True, align_axis=0)\
.loc[(slice(None), slice('self')), :].count().to_frame().T
print(res)
City Temp Zone Score Activity
0 0 1 0 1 3
Explanation
- Use
df.compare
withkeep_shape=True
(keeps cols with no differences), andalign_axis=0
. We get the following df:
print(df1.compare(df2, keep_shape=True, align_axis=0))
City Temp Zone Score Activity
0 self NaN NaN NaN 100.0 NaN
other NaN NaN NaN 150.0 NaN
1 self NaN NaN NaN NaN 500.0
other NaN NaN NaN NaN 450.0
2 self NaN NaN NaN NaN 600.0
other NaN NaN NaN NaN 650.0
3 self NaN 1000.0 NaN NaN 700.0
other NaN 1200.0 NaN NaN 750.0
- Now, we simple get the
count
, and divide the result by 2 (applyingSeries.div
), since the count will be for both "self" (df1
) and "other" (df2
). I.e. it is double. Alternative would be to usedf.loc
to select onlyself
from index level 1 and applycount
afterwards. - Finally, we use
Series.to_frame
anddf.T
to get the data in the required shape.
CodePudding user response:
You can use slicing with df.isin
and use df.count
to get the number of differences.
df1[(~df1.isin(df2))].count()
City 0
Temp 1
Zone 0
Score 1
Activity 3
dtype: int64
CodePudding user response:
Since Spark is a distributed computing engine, if you're using the pyspark api without pandas
, unless you predefine the order of your dataframe, you can't do the comparison directly since the row order may be completely different.
df1 = spark.createDataFrame(
[
('Atlanta', 10, 1, 100, 400),
('Chicago', 100, 2, 200, 500),
('Boston', 100, 3, 300, 600),
('San Francisco', 1000, 4, 400, 700)
],
schema=['city', 'temp', 'zone', 'score', 'activity']
)
df2 = spark.createDataFrame(
[
('Atlanta', 10, 1, 150, 400),
('Chicago', 100, 2, 200, 450),
('Boston', 100, 3, 300, 650),
('San Francisco', 1200, 4, 400, 750)
],
schema=['city', 'temp', 'zone', 'score', 'activity']
)
df1.show(10, False)
df2.show(10, False)
------------- ---- ---- ----- --------
|city |temp|zone|score|activity|
------------- ---- ---- ----- --------
|Atlanta |10 |1 |100 |400 |
|Chicago |100 |2 |200 |500 |
|Boston |100 |3 |300 |600 |
|San Francisco|1000|4 |400 |700 |
------------- ---- ---- ----- --------
------------- ---- ---- ----- --------
|city |temp|zone|score|activity|
------------- ---- ---- ----- --------
|Atlanta |10 |1 |150 |400 |
|Chicago |100 |2 |200 |450 |
|Boston |100 |3 |300 |650 |
|San Francisco|1200|4 |400 |750 |
------------- ---- ---- ----- --------
Let's say we use the city to do the ordering:
df1 = df1.withColumn('row_num', func.row_number().over(Window.orderBy('city')))
df2 = df2.withColumn('row_num', func.row_number().over(Window.orderBy('city')))
df1.show(10, False)
df2.show(10, False)
------------- ---- ---- ----- -------- -------
|city |temp|zone|score|activity|row_num|
------------- ---- ---- ----- -------- -------
|Atlanta |10 |1 |100 |400 |1 |
|Boston |100 |3 |300 |600 |2 |
|Chicago |100 |2 |200 |500 |3 |
|San Francisco|1000|4 |400 |700 |4 |
------------- ---- ---- ----- -------- -------
------------- ---- ---- ----- -------- -------
|city |temp|zone|score|activity|row_num|
------------- ---- ---- ----- -------- -------
|Atlanta |10 |1 |150 |400 |1 |
|Boston |100 |3 |300 |650 |2 |
|Chicago |100 |2 |200 |450 |3 |
|San Francisco|1200|4 |400 |750 |4 |
------------- ---- ---- ----- -------- -------
Once we collect the row number of each record, we can do the joining:
join_df = df1.selectExpr('row_num', *[f"{col} AS {col}_1" for col in df1.columns[:-1]])\
.join(
df2.selectExpr('row_num', *[f"{col} AS {col}_2" for col in df2.columns[:-1]]),
on='row_num', how='inner'
)
join_df.show(10, False)
------- ------------- ------ ------ ------- ---------- ------------- ------ ------ ------- ----------
|row_num|city_1 |temp_1|zone_1|score_1|activity_1|city_2 |temp_2|zone_2|score_2|activity_2|
------- ------------- ------ ------ ------- ---------- ------------- ------ ------ ------- ----------
|1 |Atlanta |10 |1 |100 |400 |Atlanta |10 |1 |150 |400 |
|2 |Boston |100 |3 |300 |600 |Boston |100 |3 |300 |650 |
|3 |Chicago |100 |2 |200 |500 |Chicago |100 |2 |200 |450 |
|4 |San Francisco|1000 |4 |400 |700 |San Francisco|1200 |4 |400 |750 |
------- ------------- ------ ------ ------- ---------- ------------- ------ ------ ------- ----------
The remaining is just the conditional counting:
join_df\
.select(
[
func.count(func.when(func.col(f"{col}_1")!=func.col(f"{col}_2"), func.lit(1))).alias(col)\
for col in df1.columns[:-1]
]
).show(100, False)
---- ---- ---- ----- --------
|city|temp|zone|score|activity|
---- ---- ---- ----- --------
|0 |1 |0 |1 |3 |
---- ---- ---- ----- --------