Stack dataframes in Pandas vertically and horizontally-CodePudding

I have a dataframe that looks like this:

country,region,region_id,year,doy,variable_a,num_pixels
USA, Iowa,12345,2022,1,32.2,100
USA, Iowa,12345,2022,2,12.2,100
USA, Iowa,12345,2022,3,22.2,100
USA, Iowa,12345,2022,4,112.2,100
USA, Iowa,12345,2022,5,52.2,100

The year in the dataframe above is 2022. I have more dataframes for other years starting from 2010 onwards. I have also dataframes for other variables: variable_b, variable_c.

I want to combine all these dataframes into a single dataframe such that

The years are listed vertically, one below the other
the data for the different variables is listed horizontally. The output should look like this:

country,region,region_id,year,doy,variable_a,variable_b,variable_c

USA, Iowa,12345,2010,1,32.2,44,101

USA, Iowa,12345,2010,2,12.2,76,2332

... ...

USA, Iowa,12345,2022,1,321.2,444,501

USA, Iowa,12345,2022,2,122.2,756,32

What is the most efficient way to achieve this?

CodePudding user response：

IIUC, this should work for you:

data1 = {
    'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
    'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data2 = {
    'country': {0: 'USB', 1: 'USB', 2: 'USB', 3: 'USB', 4: 'USB'},
    'region': {0: ' Iowb', 1: ' Iowb', 2: ' Iowb', 3: ' Iowb', 4: ' Iowb'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2021, 1: 2021, 2: 2021, 3: 2021, 4: 2021},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data3 = {
    'country': {0: 'USC', 1: 'USC', 2: 'USC', 3: 'USC', 4: 'USC'},
    'region': {0: ' Iowc', 1: ' Iowc', 2: ' Iowc', 3: ' Iowc', 4: ' Iowc'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

dfn = [df1, df2, df3]

pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)

Output:

CodePudding user response：

Use pd.concat method to do this efficiently. The method does the work by listing all the data frames in vertical order and also creates new columns for all the new variables.

Here is an example of how pd.concat works I created with duplicate data.

CODE

import pandas as pd

df1 = pd.DataFrame({"country": ["USA", "USA", "USA"], "region": ["Iowa", "Iowa", "Iowa"],
                    "region_id": [12345, 12345, 12345], "year": [2022, 2022, 2022], "doy": [1, 2, 3],
                    "variable_a": [32.2, 12.2, 22.2], "num_pixles": [100, 100, 100]})

df2 = pd.DataFrame({"country": ["USA", "USA", "USA"], "region": ["Iowa", "Iowa", "Iowa"],
                    "region_id": [12345, 12345, 12345], "year": [2020, 2020, 2020], "doy": [1, 2, 3],
                    "variable_b": [54.2, 62.2, 2.2], "num_pixles": [100, 100, 100]})

df_list = [df1, df2]  # list of dataframes

res = pd.concat(df_list) # concat the list of dataframes
res = res.sort_values(by="year").reset_index(drop=True)  # To make sure that the rows are sorted based on year
print(res)

OUTPUT

      country region  region_id  year  doy  variable_a  num_pixles  variable_b
0     USA   Iowa      12345  2020    1         NaN         100        54.2
1     USA   Iowa      12345  2020    2         NaN         100        62.2
2     USA   Iowa      12345  2020    3         NaN         100         2.2
3     USA   Iowa      12345  2022    1        32.2         100         NaN
4     USA   Iowa      12345  2022    2        12.2         100         NaN
5     USA   Iowa      12345  2022    3        22.2         100         NaN