Home > Software design >  Best way to merge two DataFrames that result in duplicates with the exception of one column?
Best way to merge two DataFrames that result in duplicates with the exception of one column?

Time:09-29

I would like to merge two DataFrames, one frame containing ID, Team Name and Team Start Date and End Date. The other frame containing ID, Service Date, and Current Primary Team. Essentially I want to see if the person was working on a Team when the service occurred. I'm hoping to merge on "ID", compare "Service Date" to "Date Team Start" and "Date Team End" with matching "ID", and return a "Team" based on having "Service Date" between the Start and End Date favoring the current "Primary Team" if dates overlap. An ID can be on more than one team at a time.

I have a table that looks something like the following:

Table_1

| ID    | Team Assigned | Date Team Start | Date Team End |
___________________________________________________________
| 23    |  Red          |   2022-09-01    |  2022-09-29   |
| 23    |  Blue         |   2022-08-01    |  2022-09-15   |
| 23    |  Green        |   2022-09-27    |  Current      |
| 14    |  Green        |   2022-08-01    |  2022-08-17   |
| 14    |  Purple       |   2022-08-15    |  Current      |
| 07    |  Blue         |   2022-07-03    |  Current      |
| 07    |  Red          |   2022-07-03    |  2022-07-05   |
| 07    |  Purple       |   2022-05-01    |  2022-06-24   |

Table_2

| ID    |  Service Date |  Current Primary Team |
| 07    |  2022-08-01   |  Blue                 |
| 07    |  2022-05-03   |  Blue                 |
| 23    |  2022-08-15   |  Green                |
| 23    |  2022-09-27   |  Green                |
| 14    |  2022-08-12   |  Purple               |

What I want after merging is something like:

| ID    |  Service Date |  Current Primary Team | Assumed Primary Team |
| 07    |  2022-08-01   |  Blue                 | Red                  |
| 07    |  2022-05-03   |  Blue                 | Purple               |
| 23    |  2022-08-15   |  Green                | Blue                 |
| 23    |  2022-09-27   |  Green                | Green                |
| 14    |  2022-08-12   |  Purple               | Green                |

I have tried a regular merge as I do not want to lose any team assignment or service dates:

df = pd.merge(Table_1, Table_2, on = ['ID'], how = 'outer')

Which results in nearly duplicate rows such as follows:

| ID    | Team Assigned | Date Team Start | Date Team End | Service Date |
___________________________________________________________
| 23    |  Red          |   2022-09-01    |  2022-09-29   | 2022-08-15
| 23    |  Red          |   2022-09-01    |  2022-09-29   | 2022-09-27
| 23    |  Blue         |   2022-08-01    |  2022-09-15   | 2022-08-15
| 23    |  Blue         |   2022-08-01    |  2022-09-15   | 2022-09-27
| 23    |  Green        |   2022-09-27    |  Current      | 2022-08-15
| 23    |  Green        |   2022-09-27    |  Current      | 2022-09-27
| 14    |  Green        |   2022-08-01    |  2022-08-17   | 2022-08-12
| 14    |  Purple       |   2022-08-15    |  Current      | 2022-08-12
| 07    |  Blue         |   2022-07-03    |  Current      | 2022-08-01
| 07    |  Blue         |   2022-07-03    |  Current      | 2022-05-03
| 07    |  Red          |   2022-07-03    |  2022-07-05   | 2022-08-01
| 07    |  Red          |   2022-07-03    |  2022-07-05   | 2022-05-03
| 07    |  Purple       |   2022-05-01    |  2022-06-24   | 2022-08-01
| 07    |  Purple       |   2022-05-01    |  2022-06-24   | 2022-05-03

I am having a hard time try to teach the program to understand which service date is likely true for each admission entry thus cannot count the number of "Service Dates" each team actually has.

Hopefully this makes sense and someone can help me out here!

Cheers

CodePudding user response:

You'll need to replace 'Current' with today's date. Then you can create a date range for every row between the Start and End dates, and explode those so every day is a new row.

Once this is done you can merge on ID and the dates

df = pd.DataFrame({'ID': [23, 23, 23, 14, 14, 7, 7, 7],
 'Team Assigned': ['Red',  'Blue',  'Green',  'Green',  'Purple',  'Blue',  'Red',  'Purple'],
 'Date Team Start': ['2022-09-01',  '2022-08-01',  '2022-09-27',  '2022-08-01',  '2022-08-15',
  '2022-07-03',  '2022-07-03',  '2022-05-01'],
 'Date Team End': ['2022-09-29',  '2022-09-15',  'Current',  '2022-08-17',  'Current',
  'Current',  '2022-07-05',  '2022-06-24']})

df2 = pd.DataFrame({'ID': [7, 7, 23, 23, 14],
 'Service Date': ['2022-08-01',  '2022-05-03',  '2022-08-15',  '2022-09-27',  '2022-08-12'],
 'Current Primary Team': ['Blue', 'Blue', 'Green', 'Green', 'Purple']})


# Replace current w/today's date
df.loc[df['Date Team End'].eq('Current'), 'Date Team End'] =  pd.Timestamp('today').date()

# Convert dates to date objects
df['Date Team Start'] = pd.to_datetime(df['Date Team Start'])
df['Date Team End'] = pd.to_datetime(df['Date Team End'])

df2['Service Date'] = pd.to_datetime(df2['Service Date'])

# Create new record for every day within date range for team start/end
df['date_range'] = df.apply(lambda x: pd.date_range(x['Date Team Start'],x['Date Team End']), axis=1)
df = df.explode('date_range')

df = df.merge(df2, left_on=['ID','date_range'], right_on=['ID','Service Date']).drop(columns='date_range')

df.rename(columns={'Team Assigned':'Assumed Primary Team'}, inplace=True)

print(df)

Output

        ID  Assumed Primary Team    Date Team Start Date Team End   Service Date    Current Primary Team
0       23                   Red         2022-09-01    2022-09-29   2022-09-27      Green
1       23                 Green         2022-09-27    2022-09-28   2022-09-27      Green
2       23                  Blue         2022-08-01    2022-09-15   2022-08-15      Green
3       14                 Green         2022-08-01    2022-08-17   2022-08-12     Purple
4        7                  Blue         2022-07-03    2022-09-28   2022-08-01       Blue
5        7                Purple         2022-05-01    2022-06-24   2022-05-03       Blue

CodePudding user response:

From your outer join result you can calculate Assumed Primary Team with np.where (import numpy as np - it's a dependency of pandas, so you'll already have it):

df['Assumed Primary Team'] = np.where(
    (
        (df['Date Team Start'] < df['Service Date'])
        &
        (df['Date Team End'] > df['Service Date'])
    ) | (
        (df['Date Team Start'] < df['Service Date'])
        &
        (df['Date Team End'] == 'Current')
    ),
    df['Team Assigned'],
    ''
)

df_result = df.loc[
    df['Assumed Primary Team'] != '',
    ['ID', 'Service Date', 'Current Primary Team', 'Assumed Primary Team']
]

Caveat being that you lose the combinations of ID, Service Time that don't have a match, but you can always filter the original source to get the rows that were lost, add a constant value column Assumed Primary Team with any value you'd like (either empty string or 'no match found' or ...) and then pd.concat it to the result.

String comparison works fine with isodates which is why I did not convert to date-objects. If your source does not have reliable isodates, you'd need to do some data conversion and most likely cleaning, but the logic afterwards stays the same.

Disclaimer: Did not run the code itself because I think the logic is the important part here. If problems arise, feel free to challenge and I'll do a real life example.

  • Related