Home > OS >  Differences in Data Frames w/ Partial Matches for Python | Multiple Separators
Differences in Data Frames w/ Partial Matches for Python | Multiple Separators

Time:05-29

I'm looking for a summary of the differences between the below data frames, but with partial matching counting as a non-difference. The separators that distinguish partial matches from full matches are either : or - (or a combination of both).

Example of a partial match would be Football-001 and Football-001:NFL-Lombardi. Alternatively, Football-002 would not be considered a partial match to Football-001:NFL-Lombardi

import pandas as pd
import numpy as np

abc = {'Sport' : ['Football-001', 'Basketball-001', 'Baseball-002', 'Hockey-002'], 'Year' : ['2021','2021','2022','2022'], 'ID' : ['1','2','3','4']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
abc
xyz = {'SportLeague' : ['Football-001:NFL-Lombardi', 'Basketball-001-NBA-OBrien', 'Baseball-002:MLB:Commissioner', 'Hockey-002-NHL:Stanley', 'Soccer:MLS-Phillip:Anschutz'], 'Year' : ['2022','2022','2022','2022', '2022'], 'ID' : ['2','3','2','4', '1']}
xyz = pd.DataFrame({k: pd.Series(v) for k, v in xyz.items()})
xyz = xyz.sort_values(by = ['ID'], ascending = True)
print(abc, xyz)

Code already tried is below. This will work for one separator, but not multiple (NOTE: The below only appears to work for one separator, the :; I'm looking for a solution suggestion that will work for either : or -):

xyz = xyz.assign(**xyz['SportLeague'].str.split(':', expand=True).set_axis(['Sport','League'], axis=1))
xyz_c = xyz.reindex(abc.columns, axis=1)
xyz_c.compare(abc.reindex_like(xyz_c), keep_shape=True, keep_equal=True)
series.str.split(re.compile("[:-]"))

Example of data before comparison:
[![enter image description here][1]][1]



Example of summary df:
[![enter image description here][2]][2]


  [1]: https://i.stack.imgur.com/fAcDf.png
  [2]: https://i.stack.imgur.com/TzWy9.png

CodePudding user response:

Based on the comments, try this:

import pandas as pd
import numpy as np

abc = {'Sport' : ['Football-001', 'Basketball-001', 'Baseball-002', 'Hockey-002'], 'Year' : ['2021','2021','2022','2022'], 'ID' : ['1','2','3','4']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
abc
xyz = {'SportLeague' : ['Football-001:NFL-Lombardi', 'Basketball-001-NBA-OBrien', 'Baseball-002:MLB:Commissioner', 'Hockey-002-NHL:Stanley', 'Soccer:MLS-Phillip:Anschutz'], 'Year' : ['2022','2022','2022','2022', '2022'], 'ID' : ['2','3','2','4', '1']}
xyz = pd.DataFrame({k: pd.Series(v) for k, v in xyz.items()})
xyz = xyz.sort_values(by = ['ID'], ascending = True)
tm = '(' '|'.join([f'{x}' for x in abc['Sport']]) ')'
temp = xyz['SportLeague'].str.extract(tm)
merged = temp.merge(abc, left_on=0, right_on='Sport').drop(0, axis=1)
not_merged = xyz[temp[0].isna()]
not_merged = not_merged.rename({x:y for (x,y) in zip(not_merged.columns, merged.columns)}, axis=1)
merged['Changes'] = 1
not_merged['Changes'] = 0
pd.concat([merged, not_merged])

Output:

    Sport   Year    ID  Changes
0   Football-001    2021    1   1
1   Baseball-002    2022    3   1
2   Basketball-001  2021    2   1
3   Hockey-002  2022    4   1
4   Soccer:MLS-Phillip:Anschutz 2022    1   0
  • Related