I have two dataframes with time series data whose values are a list. I want to merge them into a single dataframe, resulting in the concatenation of the two lists. The tricky part is one dataframe contains more data points than the other. When I do a "sum" of the two dataframes, for the dates that do not overlap, the values come out as NaN, instead of keeping the original value. To provide example, df1 contains the following:
2021-09-16 [AAA,BBB]
2021-09-17 [CCC]
2021-09-18 [DDD]
df2 contains the following:
2021-09-16 [EEE]
2021-09-17 [FFF]
2021-09-18 []
2021-09-19 [GGG]
2021-09-20 [HHH,III]
When I do df1 df2, I get:
2021-09-16 [AAA,BBB,EEE]
2021-09-17 [CCC,FFF]
2021-09-18 [DDD]
2021-09-19 NaN
2021-09-20 NaN
Instead, I want 2021-09-19 and 2021-09-20 to show [GGG] and [HHH,III], respectively. What's the solution?
CodePudding user response:
Setup
>>> df1
col
2021-09-16 [AAA, BBB]
2021-09-17 [CCC]
2021-09-18 [DDD]
>>> df2
col
2021-09-16 [EEE]
2021-09-17 [FFF]
2021-09-18 []
2021-09-19 [GGG]
2021-09-20 [HHH, III]
Solution: DataFrame.align
Align the two dataframes along the index axis and fill the values of missing index with empty list []
, then we can easily add the two dataframes using df1 df2
d1, d2 = df1.align(df2, fill_value=[]))
print(d1 d2)
col
2021-09-16 [AAA, BBB, EEE]
2021-09-17 [CCC, FFF]
2021-09-18 [DDD, ]
2021-09-19 [GGG]
2021-09-20 [HHH, III]