I have two datasets (Lots and Measurements) that both have datetime indexes but different lengths and columns. The first dataset (Lots) is structured as follows:
Datetime Index | Lot Group | Lot No | Booking Level |
---|---|---|---|
2013-08-03 10:00:00 | 1 | 261291.0 | PROB1H |
2013-08-03 12:00:00 | 1 | 261228.0 | PROB1H |
The other one (Measurements) is structured as follows:
Datetime Index | MID | Passed? | Measurement1 | Measurement2 | Measurement3 |
---|---|---|---|---|---|
2013-08-28 10:00:00 | 12345 | True | 46.908 | 3.89 | 29.056 |
2013-08-03 12:00:00 | 78262 | True | 89.457 | 6.88 | 34.918 |
What I am trying to do is merge the two dataframes on the datetime index and get all the columns in both dataframes and if there is a match on the datetime index it would add the MID, Passed? and measurement columns to the Lots dataframeand would also keep the duplicates if there are any and would also keep missing values as NaNs for example:
Let's say the datetime 2013-08-28 10:00:00 is not present in the Lots dataframe but is present in the Measurement dataframe so that would produce:
Datetime Index | Lot Group | Lot No | Booking Level | MID | Passed? | Measurement1 | Measurement2 | Measurement3 |
---|---|---|---|---|---|---|---|---|
2013-08-28 10:00:00 | NaN | NaN | NaN | 12345 | True | 46.908 | 3.89 | 29.056 |
and if there is a match for example in the datetime 2013-08-03 12:00:00 it would produce:
Datetime Index | Lot Group | Lot No | Booking Level | MID | Passed? | Measurement1 | Measurement2 | Measurement3 |
---|---|---|---|---|---|---|---|---|
2013-08-03 12:00:00 | 1 | 261228.0 | PROB1H | 78262 | True | 89.457 | 6.88 | 34.918 |
The Lots dataframe's datetime index has only unique datetime values but the Measurement dataframe there are duplicate entries so if there is a match with a duplicate entry I would like to get duplicate rows for example:
Let's say the datetime 2021-04-15 22:00:00 is present in both dataframes but is found multiple times in the Measurement dataframe so it would produce the following:
Datetime Index | Lot Group | Lot No | Booking Level | MID | Passed? | Measurement1 | Measurement2 | Measurement3 |
---|---|---|---|---|---|---|---|---|
2021-04-15 22:00:00 | 2 | 311000.0 | PROB2H | 34903 | True | 39 | 67 | 50 |
2021-04-15 22:00:00 | 2 | 311000.0 | PROB2H | 34904 | True | 88 | 40.90 | 54.38 |
I have tried different merging but couldn't get the result I wanted I tried:
test = lots.merge(measurement, how = "right",left_index=True, right_index=True)
test2 = lots.merge(measurement, how = "outer",left_index=True, right_index=True)
How would you suggest I go about this, thanks in advance.
CodePudding user response:
You can use join
as well as merge
:
# Dataset Lots
>>> dfL
Lot Group Lot No Booking Level
Datetime Index
2013-08-03 10:00:00 1 261291.0 PROB1H
2013-08-03 12:00:00 1 261228.0 PROB1H
2021-04-15 22:00:00 2 311000.0 PROB2H
# Dataset Measurements
>>> dfM
MID Passed? Measurement1 Measurement2 Measurement3
Datetime Index
2013-08-28 10:00:00 12345 True 46.908 3.89 29.056
2013-08-03 12:00:00 78262 True 89.457 6.88 34.918
2021-04-15 22:00:00 34903 True 39.000 67.00 50.000
2021-04-15 22:00:00 34904 True 88.000 40.90 54.380
# Join version
>>> dfL.join(dfM, how='right'))
Lot Group Lot No Booking Level MID Passed? Lot Group Lot No Booking Level MID Passed? Measurement1 Measurement2 Measurement3
Datetime Index
2013-08-03 12:00:00 1.0 261228.0 PROB1H 78262 True 89.457 6.88 34.918
2013-08-28 10:00:00 NaN NaN NaN 12345 True 46.908 3.89 29.056
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34903 True 39.000 67.00 50.000
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34904 True 88.000 40.90 54.380
# Merge version
>>> dfL.merge(dfM, how='right', left_index=True, right_index=True)
Lot Group Lot No Booking Level MID Passed? Lot Group Lot No Booking Level MID Passed? Measurement1 Measurement2 Measurement3
Datetime Index
2013-08-03 12:00:00 1.0 261228.0 PROB1H 78262 True 89.457 6.88 34.918
2013-08-28 10:00:00 NaN NaN NaN 12345 True 46.908 3.89 29.056
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34903 True 39.000 67.00 50.000
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34904 True 88.000 40.90 54.380
CodePudding user response:
Your merge
attempts are close.
test2 = lots.merge(measurements, how='right', on='Datetime Index')
print(test2)
Datetime Index Lot Group Lot No Booking Level MID Passed? Measurement1 Measurement2 Measurement3
0 2013-08-28 10:00:00 NaN NaN NaN 12345 True 46.908 3.89 29.056
1 2013-08-03 12:00:00 1.0 261228.0 PROB1H 78262 True 89.457 6.88 34.918
This example still works if you omit on='Datetime Index'
, though it's probably better to keep it to show intent. From DataFrame.merge:
If
on
isNone
and not merging on indexes then [merge] defaults to the intersection of the columns in both DataFrames.