I'd like to compare the difference in data frames. xyz
has all of the same columns as abc
, but it has an additional column.
In the comparison, I'd like match up the two like columns (Sport
) but only show the SportLeague
in the output (if a difference exists, that is). Example, instead of showing 'Soccer' as a difference, show 'Soccer:MLS', which is the adjacent column in xyz
)
Here's a screenshot of the two data frames:
import pandas as pd
import numpy as np
abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey'], 'Year' : ['2021','2021','2022','2022'], 'ID' : ['1','2','3','4']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
abc
xyz = {'Sport' : ['Football', 'Football', 'Basketball', 'Baseball', 'Hockey', 'Soccer'], 'SportLeague' : ['Football:NFL', 'Football:XFL', 'Basketball:NBA', 'Baseball:MLB', 'Hockey:NHL', 'Soccer:MLS'], 'Year' : ['2022','2019', '2022','2022','2022', '2022'], 'ID' : ['2','0', '3','2','4', '1']}
xyz = pd.DataFrame({k: pd.Series(v) for k, v in xyz.items()})
xyz = xyz.sort_values(by = ['ID'], ascending = True)
xyz
Code already tried:
abc.compare(xyz, align_axis=1, keep_shape=False, keep_equal=False)
The error I get is the following (since the data frames don't have the exact same columns):
Example. If xyz['Sport'] does not show up anywhere within abc['Sport'], then show xyz['SportLeague]' as the difference between the data frames
Further clarification of the logic:
Does abc['Sport'] appear anywhere in xyz['Sport']? If not, indicate "Not Found in xyz data frame
". If it does exist, are its corresponding abc['Year'] and abc['ID'] values the same? If not, show "Change from xyz['Year'] and xyz['ID'] to abc['Year'] and abc['ID'].
Does xyz['Sport'] appear anywhere in abc['Sport']? If not, indicate "Remove xyz['SportLeague']
".
What I've explained above is similar to the .compare
method. However, the data frames in this example may not be the same length and have different amounts of variables.
CodePudding user response:
If I understand you correctly, we basically want to merge both DataFrames, and then apply a number of comparisons between both DataFrames, and add a column that explains the course of action to be taken, given a certain result of a given comparison.
Note: in the example here I have added one sport ('Cricket') to your df abc
, to trigger the condition abc['Sport']
does not exist in xyz['Sport']
.
abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey','Cricket'], 'Year' : ['2021','2021','2022','2022','2022'], 'ID' : ['1','2','3','4','5']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
print(abc)
Sport Year ID
0 Football 2021 1
1 Basketball 2021 2
2 Baseball 2022 3
3 Hockey 2022 4
4 Cricket 2022 5
I've left xyz
unaltered. Now, let's merge these two dfs:
df = xyz.merge(abc, on='Sport', how='outer', suffixes=('_xyz','_abc'))
print(df)
Sport SportLeague Year_xyz ID_xyz Year_abc ID_abc
0 Football Football:XFL 2019 0 2021 1
1 Football Football:NFL 2022 2 2021 1
2 Soccer Soccer:MLS 2022 1 NaN NaN
3 Baseball Baseball:MLB 2022 2 2022 3
4 Basketball Basketball:NBA 2022 3 2021 2
5 Hockey Hockey:NHL 2022 4 2022 4
6 Cricket NaN NaN NaN 2022 5
Now, we have a df where we can evaluate your set of conditions using np.select(conditions, choices, default)
. Like this:
conditions = [ df.Year_abc.isnull(),
df.Year_xyz.isnull(),
(df.Year_xyz != df.Year_abc) & (df.ID_xyz != df.ID_abc),
df.Year_xyz != df.Year_abc,
df.ID_xyz != df.ID_abc
]
choices = [ 'Sport not in abc',
'Sport not in xyz',
'Change year and ID to xyz',
'Change year to xyz',
'Change ID to xyz']
df['action'] = np.select(conditions, choices, default=np.nan)
Result as below with a new column action
with notes on which course of action to take.
Sport SportLeague Year_xyz ID_xyz Year_abc ID_abc \
0 Football Football:XFL 2019 0 2021 1
1 Football Football:NFL 2022 2 2021 1
2 Soccer Soccer:MLS 2022 1 NaN NaN
3 Baseball Baseball:MLB 2022 2 2022 3
4 Basketball Basketball:NBA 2022 3 2021 2
5 Hockey Hockey:NHL 2022 4 2022 4
6 Cricket NaN NaN NaN 2022 5
action
0 Change year and ID to xyz # match, but mismatch year and ID
1 Change year and ID to xyz # match, but mismatch year and ID
2 Sport not in abc # no match: Sport in xyz, but not in abc
3 Change ID to xyz # match, but mismatch ID
4 Change year and ID to xyz # match, but mismatch year and ID
5 nan # complete match: no action needed
6 Sport not in xyz # no match: Sport in abc, but not in xyz
Let me know if this is a correct interpretation of what you are looking to achieve.