How to Merge two Panel data sets on Date and a combination of columns?-CodePudding

I have two datasets, df1 &df2, that look like this:

df1:

Date	Code	City	State	Population	Cases	Deaths
2020-03	10001	Los Angeles	CA	5000000	122	12
2020-03	10002	Sacramento	CA	5400000	120	2
2020-03	12223	Houston	TX	3500000	23	11
...	...	...	...	...	...	...
2021-07	10001	Los Angeles	CA	5000002	12220	2200
2021-07	10002	Sacramento	CA	5444000	211	22
2021-07	12223	Houston	TX	4443300	2111	330

df2:

Date	Code	City	State	Quantity x	Quantity y
2019-01	100015	LOS ANGELES	CA		445
2019-01	100015	LOS ANGELES	CA	330
2019-01	100023	SACRAMENTO	CA	4450	566
2019-01	1222393	HOUSTON	TX	440	NA
...	...	...	...	...	...
2021-07	100015	LOS ANGELES	CA	31113	3455
2021-07	100023	SACRAMENTO	CA	3220	NA
2021-07	1222393	HOUSTON	TX	NA	3200

As you can see, df2 starts before df1, but they both end on the same dates. Also, df1 and df2 both have IDs that share some commonalities but are not equal (in general, df2 has an extra one or two digits than df1).

Also note that there may be multiple entries for the same date on df2, but with different quantities.

I want to merge the two, more specifically I want to merge df1 on df2, such that it starts on 2019-01 and ends on 2021-07. In that case, Cases and Deaths would be 0 between 2019-01 and 2020-02.

How can I merge df1 on df2 using Date, City, and State (given that some cities have the same name but are in different states) and using the date specifications mentioned above? I would like my combined data frame, df3, to look like this:

Date	Code	City	State	Quantity x	Quantity y	Population	Cases	Deaths
2019-01	10001	Los Angeles	CA		445		0	0
2019-01	10002	Sacramento	CA	4450	556		0	0
2020-03	12223	Houston	TX	440	4440	35000000	23	11
...	...	...	...	...	...	...	...	...
2021-07	10002	Sacramento	CA	3220	NA	5444000	211	22

Edit: I was thinking: maybe I can first cut df1's date range and then do the merge. I am not sure about how an outer merge would work with dates that don't necessarily overlap. Perhaps someone has a better idea.

CodePudding user response：

It sounds like you're looking for the how keyword argument in pd.DataFrame.merge and pd.DataFrame.join.

Here is a sample:

import pandas as pd

df1 = pd.read_json(
    '{"Date":{"0":1583020800000,"1":1583020800000,"2":1583020800000,"3":1625097600000,"4":1625097600000,"5":1625097600000},"City":{"0":"Los Angeles","1":"Sacramento","2":"Houston","3":"Los Angeles","4":"Sacramento","5":"Houston"},"State":{"0":"CA","1":"CA","2":"TX","3":"CA","4":"CA","5":"TX"},"Population":{"0":5000000,"1":5400000,"2":3500000,"3":5000002,"4":5444000,"5":4443300},"Cases":{"0":122,"1":120,"2":23,"3":12220,"4":211,"5":2111},"Deaths":{"0":12,"1":2,"2":11,"3":2200,"4":22,"5":330}}'
)
df2 = pd.read_json(
    '{"Date":{"0":1546300800000,"1":1546300800000,"2":1546300800000,"3":1546300800000,"4":1625097600000,"5":1625097600000,"6":1625097600000},"City":{"0":"LOS ANGELES","1":"LOS ANGELES","2":"SACRAMENTO","3":"HOUSTON","4":"LOS ANGELES","5":"SACRAMENTO","6":"HOUSTON"},"State":{"0":"CA","1":"CA","2":"CA","3":"TX","4":"CA","5":"CA","6":"TX"},"Quantity x":{"0":null,"1":330.0,"2":4450.0,"3":440.0,"4":31113.0,"5":3220.0,"6":null},"Quantity y":{"0":445.0,"1":null,"2":566.0,"3":null,"4":3455.0,"5":null,"6":3200.0}}'
)

print("\ndf1 = \n", df1)
print("\ndf2 = \n", df2)

# Transform df1
df1["City"] = df1["City"].apply(str.upper)  # To merge, need consistent casing
df1 = df1.groupby(["Date", "City", "State"])[
    ["Cases", "Deaths"]
].sum()  # Aggregate cases   deaths just in case...


# Aggregate in df2
df2 = df2.groupby(["Date", "City", "State"])[
    ["Quantity x", "Quantity y"]
].sum()  # implicit skipna=True

print("\ndf1' = \n", df1)
print("\ndf2' = \n", df2)

# MERGE: merging on indices
df3 = df1.join(df2, how="outer")  # key: "how"
df3[["Cases", "Deaths"]] = (
    df3[["Cases", "Deaths"]].fillna(0).astype(int)
)  # inplace: downcasting complaint

df3.reset_index(
    inplace=True
)  # Will cause ["Date", "City", "State"] to be ordinary columns, not indices.

print("\ndf3 = \n", df3)

...the output is:

df1 = 
         Date         City State  Population  Cases  Deaths
0 2020-03-01  Los Angeles    CA     5000000    122      12
1 2020-03-01   Sacramento    CA     5400000    120       2
2 2020-03-01      Houston    TX     3500000     23      11
3 2021-07-01  Los Angeles    CA     5000002  12220    2200
4 2021-07-01   Sacramento    CA     5444000    211      22
5 2021-07-01      Houston    TX     4443300   2111     330

df2 = 
         Date         City State  Quantity x  Quantity y
0 2019-01-01  LOS ANGELES    CA         NaN       445.0
1 2019-01-01  LOS ANGELES    CA       330.0         NaN
2 2019-01-01   SACRAMENTO    CA      4450.0       566.0
3 2019-01-01      HOUSTON    TX       440.0         NaN
4 2021-07-01  LOS ANGELES    CA     31113.0      3455.0
5 2021-07-01   SACRAMENTO    CA      3220.0         NaN
6 2021-07-01      HOUSTON    TX         NaN      3200.0

df1' = 
                               Cases  Deaths
Date       City        State               
2020-03-01 HOUSTON     TX        23      11
           LOS ANGELES CA       122      12
           SACRAMENTO  CA       120       2
2021-07-01 HOUSTON     TX      2111     330
           LOS ANGELES CA     12220    2200
           SACRAMENTO  CA       211      22

df2' = 
                               Quantity x  Quantity y
Date       City        State                        
2019-01-01 HOUSTON     TX          440.0         0.0
           LOS ANGELES CA          330.0       445.0
           SACRAMENTO  CA         4450.0       566.0
2021-07-01 HOUSTON     TX            0.0      3200.0
           LOS ANGELES CA        31113.0      3455.0
           SACRAMENTO  CA         3220.0         0.0

df3 = 
         Date         City State  Cases  Deaths  Quantity x  Quantity y
0 2019-01-01      HOUSTON    TX      0       0       440.0         0.0
1 2019-01-01  LOS ANGELES    CA      0       0       330.0       445.0
2 2019-01-01   SACRAMENTO    CA      0       0      4450.0       566.0
3 2020-03-01      HOUSTON    TX     23      11         NaN         NaN
4 2020-03-01  LOS ANGELES    CA    122      12         NaN         NaN
5 2020-03-01   SACRAMENTO    CA    120       2         NaN         NaN
6 2021-07-01      HOUSTON    TX   2111     330         0.0      3200.0
7 2021-07-01  LOS ANGELES    CA  12220    2200     31113.0      3455.0
8 2021-07-01   SACRAMENTO    CA    211      22      3220.0         0.0

A few other points:

City casing needs to be consistent at join/merge time.
You could also do: df1.merge(df2, ..., left_index=True, right_index=True) instead of df1.join. You could also reset the indices via df1.reset_index(inplace=True), etc. after the groupby-sum line(s) then use .merge(..., on=...) (but the indices are convenient).
The final values of Quantity {x,y} are floats because NaNs are present. (See next point.)
I would be deliberate about your treatment of NaNs v. auto-filled 0s. In the case of Cases/Deaths it sounds like you had no data BUT you were making the assumption that - in the absence of Cases/Deaths data - the values are 0. For the Quantity {x,y} variables, no such assumption seemed to be warranted.