Stack dataframes in Pandas vertically and horizontally with overlapping years-CodePudding

I have the following 3 dataframes:

data1 = {
    'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
    'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data2 = {
    'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
    'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2021, 1: 2020, 2: 2021, 3: 2021, 4: 2021},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data3 = {
    'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
    'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowc'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

I want to combine them as per the logic here: Stack dataframes in Pandas vertically and horizontally

However, in that question the years were different in df1, df2 and df 3 and in this question, there is some overlap in years so the solution does not work here. How do I achieve this?

NOTE:

When I try this:

dfn = [df1, df2, df3]

pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)

The output is:

 country region  region_id  year  doy  variable_a  num_pixels  variable_b  variable_c1  variable_c2
0      USA   Iowa      12345  2020    1        32.2         100         NaN          NaN          NaN
1      USA   Iowa      12345  2020    2        12.2         100         NaN          NaN          NaN
2      USA   Iowa      12345  2020    3        22.2         100         NaN          NaN          NaN
3      USA   Iowa      12345  2020    4       112.2         100         NaN          NaN          NaN
4      USA   Iowa      12345  2020    5        52.2         100         NaN          NaN          NaN
5      USA   Iowa      12345  2020    2         NaN         100        12.2          NaN          NaN
6      USA   Iowa      12345  2020    1         NaN         100         NaN         32.2         32.2
7      USA   Iowa      12345  2020    2         NaN         100         NaN         12.2         12.2
8      USA   Iowa      12345  2020    3         NaN         100         NaN         22.2         22.2
9      USA   Iowa      12345  2020    4         NaN         100         NaN        112.2        112.2
10     USA   Iowa      12345  2020    5         NaN         100         NaN         52.2         52.2
11     USA   Iowa      12345  2021    1         NaN         100        32.2          NaN          NaN
12     USA   Iowa      12345  2021    3         NaN         100        22.2          NaN          NaN
13     USA   Iowa      12345  2021    4         NaN         100       112.2          NaN          NaN
14     USA   Iowa      12345  2021    5         NaN         100        52.2          NaN          NaN

Note that year 2020 and doy 1 and 2 are repeated twice. I want that repetition to go away and instead all values from all variables be in a single row for the given year and doy

CodePudding user response：

What you want is combine_first. When you call df1.combine_first(df2), it fills NA cells in df1 with matching cells in df2. The two data frames are matched on their indexes.

cols = ["year", "country", "region", "doy"]
result = None
for frame in [df1, df2, df3]:
    if result is None:
        result = frame.set_index(cols)
    else:
        result = result.combine_first(frame.set_index(cols))

Result:

                         num_pixels  region_id  variable_a  variable_b  variable_c1  variable_c2
year country region doy                                                                         
2020 USA      Iowa  1           100      12345        32.2         NaN         32.2         32.2
                    2           100      12345        12.2        12.2         12.2         12.2
                    3           100      12345        22.2         NaN         22.2         22.2
                    4           100      12345       112.2         NaN        112.2        112.2
                    5           100      12345        52.2         NaN          NaN          NaN
              Iowc  5           100      12345         NaN         NaN         52.2         52.2
2021 USA      Iowa  1           100      12345         NaN        32.2          NaN          NaN
                    3           100      12345         NaN        22.2          NaN          NaN
                    4           100      12345         NaN       112.2          NaN          NaN
                    5           100      12345         NaN        52.2          NaN          NaN