Comparing Similar Data Frames with Like-Columns in Python-CodePudding

I'd like to compare the difference in data frames. xyz has all of the same columns as abc, but it has an additional column.

In the comparison, I'd like match up the two like columns (Sport) but only show the SportLeague in the output (if a difference exists, that is). Example, instead of showing 'Soccer' as a difference, show 'Soccer:MLS', which is the adjacent column in xyz)

Here's a screenshot of the two data frames:

import pandas as pd
import numpy as np

abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey'], 'Year' : ['2021','2021','2022','2022'], 'ID' : ['1','2','3','4']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
abc
xyz = {'Sport' : ['Football', 'Football', 'Basketball', 'Baseball', 'Hockey', 'Soccer'], 'SportLeague' : ['Football:NFL', 'Football:XFL', 'Basketball:NBA', 'Baseball:MLB', 'Hockey:NHL', 'Soccer:MLS'], 'Year' : ['2022','2019', '2022','2022','2022', '2022'], 'ID' : ['2','0', '3','2','4', '1']}
xyz = pd.DataFrame({k: pd.Series(v) for k, v in xyz.items()})
xyz = xyz.sort_values(by = ['ID'], ascending = True)
xyz

Code already tried:

abc.compare(xyz, align_axis=1, keep_shape=False, keep_equal=False)

The error I get is the following (since the data frames don't have the exact same columns):

Example. If xyz['Sport'] does not show up anywhere within abc['Sport'], then show xyz['SportLeague]' as the difference between the data frames

Further clarification of the logic:

Does abc['Sport'] appear anywhere in xyz['Sport']? If not, indicate "Not Found in xyz data frame". If it does exist, are its corresponding abc['Year'] and abc['ID'] values the same? If not, show "Change from xyz['Year'] and xyz['ID'] to abc['Year'] and abc['ID'].

Does xyz['Sport'] appear anywhere in abc['Sport']? If not, indicate "Remove xyz['SportLeague']".

What I've explained above is similar to the .compare method. However, the data frames in this example may not be the same length and have different amounts of variables.

CodePudding user response：

If I understand you correctly, we basically want to merge both DataFrames, and then apply a number of comparisons between both DataFrames, and add a column that explains the course of action to be taken, given a certain result of a given comparison.

Note: in the example here I have added one sport ('Cricket') to your df abc, to trigger the condition abc['Sport'] does not exist in xyz['Sport'].

abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey','Cricket'], 'Year' : ['2021','2021','2022','2022','2022'], 'ID' : ['1','2','3','4','5']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
print(abc)

        Sport  Year ID
0    Football  2021  1
1  Basketball  2021  2
2    Baseball  2022  3
3      Hockey  2022  4
4     Cricket  2022  5

I've left xyz unaltered. Now, let's merge these two dfs:

df = xyz.merge(abc, on='Sport', how='outer', suffixes=('_xyz','_abc'))
print(df)

        Sport     SportLeague Year_xyz ID_xyz Year_abc ID_abc
0    Football    Football:XFL     2019      0     2021      1
1    Football    Football:NFL     2022      2     2021      1
2      Soccer      Soccer:MLS     2022      1      NaN    NaN
3    Baseball    Baseball:MLB     2022      2     2022      3
4  Basketball  Basketball:NBA     2022      3     2021      2
5      Hockey      Hockey:NHL     2022      4     2022      4
6     Cricket             NaN      NaN    NaN     2022      5

Now, we have a df where we can evaluate your set of conditions using np.select(conditions, choices, default). Like this:

conditions  = [ df.Year_abc.isnull(), 
               df.Year_xyz.isnull(), 
               (df.Year_xyz != df.Year_abc) & (df.ID_xyz != df.ID_abc),
               df.Year_xyz != df.Year_abc,
               df.ID_xyz != df.ID_abc
               ]

choices     = [ 'Sport not in abc', 
               'Sport not in xyz', 
               'Change year and ID to xyz', 
               'Change year to xyz',
               'Change ID to xyz']
    
df['action'] = np.select(conditions, choices, default=np.nan)

Result as below with a new column action with notes on which course of action to take.

        Sport     SportLeague Year_xyz ID_xyz Year_abc ID_abc  \
0    Football    Football:XFL     2019      0     2021      1   
1    Football    Football:NFL     2022      2     2021      1   
2      Soccer      Soccer:MLS     2022      1      NaN    NaN   
3    Baseball    Baseball:MLB     2022      2     2022      3   
4  Basketball  Basketball:NBA     2022      3     2021      2   
5      Hockey      Hockey:NHL     2022      4     2022      4   
6     Cricket             NaN      NaN    NaN     2022      5   

                      action  
0  Change year and ID to xyz  # match, but mismatch year and ID
1  Change year and ID to xyz  # match, but mismatch year and ID
2           Sport not in abc  # no match: Sport in xyz, but not in abc
3           Change ID to xyz  # match, but mismatch ID 
4  Change year and ID to xyz  # match, but mismatch year and ID 
5                        nan  # complete match: no action needed
6           Sport not in xyz  # no match: Sport in abc, but not in xyz

Let me know if this is a correct interpretation of what you are looking to achieve.