Related to Stack dataframes in Pandas vertically and horizontally
I have the following 3 dataframes:
data1 = {
'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
data2 = {
'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2021, 1: 2020, 2: 2021, 3: 2021, 4: 2021},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
data3 = {
'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowc'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
I want to combine them as per the logic here: Stack dataframes in Pandas vertically and horizontally
However, in that question the years were different in df1, df2 and df 3 and in this question, there is some overlap in years so the solution does not work here. How do I achieve this?
NOTE:
When I try this:
dfn = [df1, df2, df3]
pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)
The output is:
country region region_id year doy variable_a num_pixels variable_b variable_c1 variable_c2
0 USA Iowa 12345 2020 1 32.2 100 NaN NaN NaN
1 USA Iowa 12345 2020 2 12.2 100 NaN NaN NaN
2 USA Iowa 12345 2020 3 22.2 100 NaN NaN NaN
3 USA Iowa 12345 2020 4 112.2 100 NaN NaN NaN
4 USA Iowa 12345 2020 5 52.2 100 NaN NaN NaN
5 USA Iowa 12345 2020 2 NaN 100 12.2 NaN NaN
6 USA Iowa 12345 2020 1 NaN 100 NaN 32.2 32.2
7 USA Iowa 12345 2020 2 NaN 100 NaN 12.2 12.2
8 USA Iowa 12345 2020 3 NaN 100 NaN 22.2 22.2
9 USA Iowa 12345 2020 4 NaN 100 NaN 112.2 112.2
10 USA Iowa 12345 2020 5 NaN 100 NaN 52.2 52.2
11 USA Iowa 12345 2021 1 NaN 100 32.2 NaN NaN
12 USA Iowa 12345 2021 3 NaN 100 22.2 NaN NaN
13 USA Iowa 12345 2021 4 NaN 100 112.2 NaN NaN
14 USA Iowa 12345 2021 5 NaN 100 52.2 NaN NaN
Note that year
2020 and doy
1 and 2 are repeated twice. I want that repetition to go away and instead all values from all variables be in a single row for the given year and doy
CodePudding user response:
What you want is combine_first
. When you call df1.combine_first(df2)
, it fills NA cells in df1
with matching cells in df2
. The two data frames are matched on their indexes.
cols = ["year", "country", "region", "doy"]
result = None
for frame in [df1, df2, df3]:
if result is None:
result = frame.set_index(cols)
else:
result = result.combine_first(frame.set_index(cols))
Result:
num_pixels region_id variable_a variable_b variable_c1 variable_c2
year country region doy
2020 USA Iowa 1 100 12345 32.2 NaN 32.2 32.2
2 100 12345 12.2 12.2 12.2 12.2
3 100 12345 22.2 NaN 22.2 22.2
4 100 12345 112.2 NaN 112.2 112.2
5 100 12345 52.2 NaN NaN NaN
Iowc 5 100 12345 NaN NaN 52.2 52.2
2021 USA Iowa 1 100 12345 NaN 32.2 NaN NaN
3 100 12345 NaN 22.2 NaN NaN
4 100 12345 NaN 112.2 NaN NaN
5 100 12345 NaN 52.2 NaN NaN