Home > Net >  How to Merge two Panel data sets on Date and a combination of columns?
How to Merge two Panel data sets on Date and a combination of columns?

Time:05-10

I have two datasets, df1 &df2, that look like this:

df1:

Date Code City State Population Cases Deaths
2020-03 10001 Los Angeles CA 5000000 122 12
2020-03 10002 Sacramento CA 5400000 120 2
2020-03 12223 Houston TX 3500000 23 11
... ... ... ... ... ... ...
2021-07 10001 Los Angeles CA 5000002 12220 2200
2021-07 10002 Sacramento CA 5444000 211 22
2021-07 12223 Houston TX 4443300 2111 330

df2:

Date Code City State Quantity x Quantity y
2019-01 100015 LOS ANGELES CA 445
2019-01 100015 LOS ANGELES CA 330
2019-01 100023 SACRAMENTO CA 4450 566
2019-01 1222393 HOUSTON TX 440 NA
... ... ... ... ... ...
2021-07 100015 LOS ANGELES CA 31113 3455
2021-07 100023 SACRAMENTO CA 3220 NA
2021-07 1222393 HOUSTON TX NA 3200

As you can see, df2 starts before df1, but they both end on the same dates. Also, df1 and df2 both have IDs that share some commonalities but are not equal (in general, df2 has an extra one or two digits than df1).

Also note that there may be multiple entries for the same date on df2, but with different quantities.

I want to merge the two, more specifically I want to merge df1 on df2, such that it starts on 2019-01 and ends on 2021-07. In that case, Cases and Deaths would be 0 between 2019-01 and 2020-02.

How can I merge df1 on df2 using Date, City, and State (given that some cities have the same name but are in different states) and using the date specifications mentioned above? I would like my combined data frame, df3, to look like this:

Date Code City State Quantity x Quantity y Population Cases Deaths
2019-01 10001 Los Angeles CA 445 0 0
2019-01 10002 Sacramento CA 4450 556 0 0
2020-03 12223 Houston TX 440 4440 35000000 23 11
... ... ... ... ... ... ... ... ...
2021-07 10002 Sacramento CA 3220 NA 5444000 211 22

Edit: I was thinking: maybe I can first cut df1's date range and then do the merge. I am not sure about how an outer merge would work with dates that don't necessarily overlap. Perhaps someone has a better idea.

CodePudding user response:

It sounds like you're looking for the how keyword argument in pd.DataFrame.merge and pd.DataFrame.join.

Here is a sample:

import pandas as pd

df1 = pd.read_json(
    '{"Date":{"0":1583020800000,"1":1583020800000,"2":1583020800000,"3":1625097600000,"4":1625097600000,"5":1625097600000},"City":{"0":"Los Angeles","1":"Sacramento","2":"Houston","3":"Los Angeles","4":"Sacramento","5":"Houston"},"State":{"0":"CA","1":"CA","2":"TX","3":"CA","4":"CA","5":"TX"},"Population":{"0":5000000,"1":5400000,"2":3500000,"3":5000002,"4":5444000,"5":4443300},"Cases":{"0":122,"1":120,"2":23,"3":12220,"4":211,"5":2111},"Deaths":{"0":12,"1":2,"2":11,"3":2200,"4":22,"5":330}}'
)
df2 = pd.read_json(
    '{"Date":{"0":1546300800000,"1":1546300800000,"2":1546300800000,"3":1546300800000,"4":1625097600000,"5":1625097600000,"6":1625097600000},"City":{"0":"LOS ANGELES","1":"LOS ANGELES","2":"SACRAMENTO","3":"HOUSTON","4":"LOS ANGELES","5":"SACRAMENTO","6":"HOUSTON"},"State":{"0":"CA","1":"CA","2":"CA","3":"TX","4":"CA","5":"CA","6":"TX"},"Quantity x":{"0":null,"1":330.0,"2":4450.0,"3":440.0,"4":31113.0,"5":3220.0,"6":null},"Quantity y":{"0":445.0,"1":null,"2":566.0,"3":null,"4":3455.0,"5":null,"6":3200.0}}'
)

print("\ndf1 = \n", df1)
print("\ndf2 = \n", df2)

# Transform df1
df1["City"] = df1["City"].apply(str.upper)  # To merge, need consistent casing
df1 = df1.groupby(["Date", "City", "State"])[
    ["Cases", "Deaths"]
].sum()  # Aggregate cases   deaths just in case...


# Aggregate in df2
df2 = df2.groupby(["Date", "City", "State"])[
    ["Quantity x", "Quantity y"]
].sum()  # implicit skipna=True

print("\ndf1' = \n", df1)
print("\ndf2' = \n", df2)

# MERGE: merging on indices
df3 = df1.join(df2, how="outer")  # key: "how"
df3[["Cases", "Deaths"]] = (
    df3[["Cases", "Deaths"]].fillna(0).astype(int)
)  # inplace: downcasting complaint

df3.reset_index(
    inplace=True
)  # Will cause ["Date", "City", "State"] to be ordinary columns, not indices.

print("\ndf3 = \n", df3)

...the output is:

df1 = 
         Date         City State  Population  Cases  Deaths
0 2020-03-01  Los Angeles    CA     5000000    122      12
1 2020-03-01   Sacramento    CA     5400000    120       2
2 2020-03-01      Houston    TX     3500000     23      11
3 2021-07-01  Los Angeles    CA     5000002  12220    2200
4 2021-07-01   Sacramento    CA     5444000    211      22
5 2021-07-01      Houston    TX     4443300   2111     330

df2 = 
         Date         City State  Quantity x  Quantity y
0 2019-01-01  LOS ANGELES    CA         NaN       445.0
1 2019-01-01  LOS ANGELES    CA       330.0         NaN
2 2019-01-01   SACRAMENTO    CA      4450.0       566.0
3 2019-01-01      HOUSTON    TX       440.0         NaN
4 2021-07-01  LOS ANGELES    CA     31113.0      3455.0
5 2021-07-01   SACRAMENTO    CA      3220.0         NaN
6 2021-07-01      HOUSTON    TX         NaN      3200.0

df1' = 
                               Cases  Deaths
Date       City        State               
2020-03-01 HOUSTON     TX        23      11
           LOS ANGELES CA       122      12
           SACRAMENTO  CA       120       2
2021-07-01 HOUSTON     TX      2111     330
           LOS ANGELES CA     12220    2200
           SACRAMENTO  CA       211      22

df2' = 
                               Quantity x  Quantity y
Date       City        State                        
2019-01-01 HOUSTON     TX          440.0         0.0
           LOS ANGELES CA          330.0       445.0
           SACRAMENTO  CA         4450.0       566.0
2021-07-01 HOUSTON     TX            0.0      3200.0
           LOS ANGELES CA        31113.0      3455.0
           SACRAMENTO  CA         3220.0         0.0

df3 = 
         Date         City State  Cases  Deaths  Quantity x  Quantity y
0 2019-01-01      HOUSTON    TX      0       0       440.0         0.0
1 2019-01-01  LOS ANGELES    CA      0       0       330.0       445.0
2 2019-01-01   SACRAMENTO    CA      0       0      4450.0       566.0
3 2020-03-01      HOUSTON    TX     23      11         NaN         NaN
4 2020-03-01  LOS ANGELES    CA    122      12         NaN         NaN
5 2020-03-01   SACRAMENTO    CA    120       2         NaN         NaN
6 2021-07-01      HOUSTON    TX   2111     330         0.0      3200.0
7 2021-07-01  LOS ANGELES    CA  12220    2200     31113.0      3455.0
8 2021-07-01   SACRAMENTO    CA    211      22      3220.0         0.0

A few other points:

  • City casing needs to be consistent at join/merge time.
  • You could also do: df1.merge(df2, ..., left_index=True, right_index=True) instead of df1.join. You could also reset the indices via df1.reset_index(inplace=True), etc. after the groupby-sum line(s) then use .merge(..., on=...) (but the indices are convenient).
  • The final values of Quantity {x,y} are floats because NaNs are present. (See next point.)
  • I would be deliberate about your treatment of NaNs v. auto-filled 0s. In the case of Cases/Deaths it sounds like you had no data BUT you were making the assumption that - in the absence of Cases/Deaths data - the values are 0. For the Quantity {x,y} variables, no such assumption seemed to be warranted.
  • Related