I have two datasets, df1
&df2
, that look like this:
df1
:
Date | Code | City | State | Population | Cases | Deaths |
---|---|---|---|---|---|---|
2020-03 | 10001 | Los Angeles | CA | 5000000 | 122 | 12 |
2020-03 | 10002 | Sacramento | CA | 5400000 | 120 | 2 |
2020-03 | 12223 | Houston | TX | 3500000 | 23 | 11 |
... | ... | ... | ... | ... | ... | ... |
2021-07 | 10001 | Los Angeles | CA | 5000002 | 12220 | 2200 |
2021-07 | 10002 | Sacramento | CA | 5444000 | 211 | 22 |
2021-07 | 12223 | Houston | TX | 4443300 | 2111 | 330 |
df2
:
Date | Code | City | State | Quantity x | Quantity y |
---|---|---|---|---|---|
2019-01 | 100015 | LOS ANGELES | CA | 445 | |
2019-01 | 100015 | LOS ANGELES | CA | 330 | |
2019-01 | 100023 | SACRAMENTO | CA | 4450 | 566 |
2019-01 | 1222393 | HOUSTON | TX | 440 | NA |
... | ... | ... | ... | ... | ... |
2021-07 | 100015 | LOS ANGELES | CA | 31113 | 3455 |
2021-07 | 100023 | SACRAMENTO | CA | 3220 | NA |
2021-07 | 1222393 | HOUSTON | TX | NA | 3200 |
As you can see, df2
starts before df1
, but they both end on the same dates. Also, df1
and df2
both have IDs that share some commonalities but are not equal (in general, df2
has an extra one or two digits than df1
).
Also note that there may be multiple entries for the same date on df2
, but with different quantities.
I want to merge the two, more specifically I want to merge df1
on df2
, such that it starts on 2019-01
and ends on 2021-07
. In that case, Cases
and Deaths
would be 0 between 2019-01
and 2020-02
.
How can I merge df1
on df2
using Date, City, and State (given that some cities have the same name but are in different states) and using the date specifications mentioned above? I would like my combined data frame, df3
, to look like this:
Date | Code | City | State | Quantity x | Quantity y | Population | Cases | Deaths |
---|---|---|---|---|---|---|---|---|
2019-01 | 10001 | Los Angeles | CA | 445 | 0 | 0 | ||
2019-01 | 10002 | Sacramento | CA | 4450 | 556 | 0 | 0 | |
2020-03 | 12223 | Houston | TX | 440 | 4440 | 35000000 | 23 | 11 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2021-07 | 10002 | Sacramento | CA | 3220 | NA | 5444000 | 211 | 22 |
Edit: I was thinking: maybe I can first cut df1
's date range and then do the merge. I am not sure about how an outer merge would work with dates that don't necessarily overlap. Perhaps someone has a better idea.
CodePudding user response:
It sounds like you're looking for the how
keyword argument in pd.DataFrame.merge
and pd.DataFrame.join
.
Here is a sample:
import pandas as pd
df1 = pd.read_json(
'{"Date":{"0":1583020800000,"1":1583020800000,"2":1583020800000,"3":1625097600000,"4":1625097600000,"5":1625097600000},"City":{"0":"Los Angeles","1":"Sacramento","2":"Houston","3":"Los Angeles","4":"Sacramento","5":"Houston"},"State":{"0":"CA","1":"CA","2":"TX","3":"CA","4":"CA","5":"TX"},"Population":{"0":5000000,"1":5400000,"2":3500000,"3":5000002,"4":5444000,"5":4443300},"Cases":{"0":122,"1":120,"2":23,"3":12220,"4":211,"5":2111},"Deaths":{"0":12,"1":2,"2":11,"3":2200,"4":22,"5":330}}'
)
df2 = pd.read_json(
'{"Date":{"0":1546300800000,"1":1546300800000,"2":1546300800000,"3":1546300800000,"4":1625097600000,"5":1625097600000,"6":1625097600000},"City":{"0":"LOS ANGELES","1":"LOS ANGELES","2":"SACRAMENTO","3":"HOUSTON","4":"LOS ANGELES","5":"SACRAMENTO","6":"HOUSTON"},"State":{"0":"CA","1":"CA","2":"CA","3":"TX","4":"CA","5":"CA","6":"TX"},"Quantity x":{"0":null,"1":330.0,"2":4450.0,"3":440.0,"4":31113.0,"5":3220.0,"6":null},"Quantity y":{"0":445.0,"1":null,"2":566.0,"3":null,"4":3455.0,"5":null,"6":3200.0}}'
)
print("\ndf1 = \n", df1)
print("\ndf2 = \n", df2)
# Transform df1
df1["City"] = df1["City"].apply(str.upper) # To merge, need consistent casing
df1 = df1.groupby(["Date", "City", "State"])[
["Cases", "Deaths"]
].sum() # Aggregate cases deaths just in case...
# Aggregate in df2
df2 = df2.groupby(["Date", "City", "State"])[
["Quantity x", "Quantity y"]
].sum() # implicit skipna=True
print("\ndf1' = \n", df1)
print("\ndf2' = \n", df2)
# MERGE: merging on indices
df3 = df1.join(df2, how="outer") # key: "how"
df3[["Cases", "Deaths"]] = (
df3[["Cases", "Deaths"]].fillna(0).astype(int)
) # inplace: downcasting complaint
df3.reset_index(
inplace=True
) # Will cause ["Date", "City", "State"] to be ordinary columns, not indices.
print("\ndf3 = \n", df3)
...the output is:
df1 =
Date City State Population Cases Deaths
0 2020-03-01 Los Angeles CA 5000000 122 12
1 2020-03-01 Sacramento CA 5400000 120 2
2 2020-03-01 Houston TX 3500000 23 11
3 2021-07-01 Los Angeles CA 5000002 12220 2200
4 2021-07-01 Sacramento CA 5444000 211 22
5 2021-07-01 Houston TX 4443300 2111 330
df2 =
Date City State Quantity x Quantity y
0 2019-01-01 LOS ANGELES CA NaN 445.0
1 2019-01-01 LOS ANGELES CA 330.0 NaN
2 2019-01-01 SACRAMENTO CA 4450.0 566.0
3 2019-01-01 HOUSTON TX 440.0 NaN
4 2021-07-01 LOS ANGELES CA 31113.0 3455.0
5 2021-07-01 SACRAMENTO CA 3220.0 NaN
6 2021-07-01 HOUSTON TX NaN 3200.0
df1' =
Cases Deaths
Date City State
2020-03-01 HOUSTON TX 23 11
LOS ANGELES CA 122 12
SACRAMENTO CA 120 2
2021-07-01 HOUSTON TX 2111 330
LOS ANGELES CA 12220 2200
SACRAMENTO CA 211 22
df2' =
Quantity x Quantity y
Date City State
2019-01-01 HOUSTON TX 440.0 0.0
LOS ANGELES CA 330.0 445.0
SACRAMENTO CA 4450.0 566.0
2021-07-01 HOUSTON TX 0.0 3200.0
LOS ANGELES CA 31113.0 3455.0
SACRAMENTO CA 3220.0 0.0
df3 =
Date City State Cases Deaths Quantity x Quantity y
0 2019-01-01 HOUSTON TX 0 0 440.0 0.0
1 2019-01-01 LOS ANGELES CA 0 0 330.0 445.0
2 2019-01-01 SACRAMENTO CA 0 0 4450.0 566.0
3 2020-03-01 HOUSTON TX 23 11 NaN NaN
4 2020-03-01 LOS ANGELES CA 122 12 NaN NaN
5 2020-03-01 SACRAMENTO CA 120 2 NaN NaN
6 2021-07-01 HOUSTON TX 2111 330 0.0 3200.0
7 2021-07-01 LOS ANGELES CA 12220 2200 31113.0 3455.0
8 2021-07-01 SACRAMENTO CA 211 22 3220.0 0.0
A few other points:
City
casing needs to be consistent at join/merge time.- You could also do:
df1.merge(df2, ..., left_index=True, right_index=True)
instead ofdf1.join
. You could also reset the indices viadf1.reset_index(inplace=True)
, etc. after the groupby-sum line(s) then use.merge(..., on=...)
(but the indices are convenient). - The final values of
Quantity {x,y}
are floats becauseNaN
s are present. (See next point.) - I would be deliberate about your treatment of
NaN
s v. auto-filled 0s. In the case ofCases
/Deaths
it sounds like you had no data BUT you were making the assumption that - in the absence ofCases
/Deaths
data - the values are0
. For theQuantity {x,y}
variables, no such assumption seemed to be warranted.