How to convert a pandas dataframe to matrix format in Python?-CodePudding

I have a pandas dataframe df which looks as shown. (df.to_dict() is given at the end):

model   scenario    module      
AIM/CGE 2.0 ADVANCE_2020_1.5C-2100  wind_total_share    high    high
SSP1-19 wind_total_share    high    high
SSP2-19 wind_total_share    high    high
AIM/CGE 2.1 CD-LINKS_NPi2020_400    wind_total_share    high    high
TERL_15D_LowCarbonTransportPolicy   wind_total_share    medium  medium
TERL_15D_NoTransportPolicy  wind_total_share    medium  medium
GCAM 4.2    SSP1-19 wind_total_share    medium  medium
IMAGE 3.0.1 IMA15-AGInt wind_total_share    low low
IMA15-Def   wind_total_share    low low
IMA15-Eff   wind_total_share    low low

model, scenario, and module are indices while Infrastructure and Investment are two columns of the dataframe. I'd like to convert these two columns into the matrix format, where Infrastructure is similar to x-axis and Investment in similar to Y-axis. Furthermore, both in Infrastructure and Investment, I need to classify them into Low, Medium and High. In the cells, I'd like to have the name of model and scenarios that satisfy the given category (low/medium/high) for given variable (Infrastructure of Investment) as shown in the screenshot below. There could also be cases when a scenario is high for Infrastructure, but low or medium for Investment and vice versa.

I also want to have another matrix, which shows the number of scenarios that fall into given cell. The second matrix matrices should look something as shown:

I am not familiar with getting such matrix format by modifying the existing pandas dataframe. Are there any functions, or module available that could be used for this purpose? How can I convert my dataframe to the matrix format as shown in the screenshots using Python?

df.to_dict() looks as follows:

{'Infrastructure': {('AIM/CGE 2.0',
   'ADVANCE_2020_1.5C-2100',
   'wind_total_share'): 'high',
  ('AIM/CGE 2.0', 'SSP1-19', 'wind_total_share'): 'high',
  ('AIM/CGE 2.0', 'SSP2-19', 'wind_total_share'): 'high',
  ('AIM/CGE 2.1', 'CD-LINKS_NPi2020_400', 'wind_total_share'): 'high',
  ('AIM/CGE 2.1',
   'TERL_15D_LowCarbonTransportPolicy',
   'wind_total_share'): 'medium',
  ('AIM/CGE 2.1', 'TERL_15D_NoTransportPolicy', 'wind_total_share'): 'medium',
  ('GCAM 4.2', 'SSP1-19', 'wind_total_share'): 'medium',
  ('IMAGE 3.0.1', 'IMA15-AGInt', 'wind_total_share'): 'low',
  ('IMAGE 3.0.1', 'IMA15-Def', 'wind_total_share'): 'low',
  ('IMAGE 3.0.1', 'IMA15-Eff', 'wind_total_share'): 'low'},
 'Investment': {('AIM/CGE 2.0',
   'ADVANCE_2020_1.5C-2100',
   'wind_total_share'): 'high',
  ('AIM/CGE 2.0', 'SSP1-19', 'wind_total_share'): 'high',
  ('AIM/CGE 2.0', 'SSP2-19', 'wind_total_share'): 'high',
  ('AIM/CGE 2.1', 'CD-LINKS_NPi2020_400', 'wind_total_share'): 'high',
  ('AIM/CGE 2.1',
   'TERL_15D_LowCarbonTransportPolicy',
   'wind_total_share'): 'medium',
  ('AIM/CGE 2.1', 'TERL_15D_NoTransportPolicy', 'wind_total_share'): 'medium',
  ('GCAM 4.2', 'SSP1-19', 'wind_total_share'): 'medium',
  ('IMAGE 3.0.1', 'IMA15-AGInt', 'wind_total_share'): 'low',
  ('IMAGE 3.0.1', 'IMA15-Def', 'wind_total_share'): 'low',
  ('IMAGE 3.0.1', 'IMA15-Eff', 'wind_total_share'): 'low'}}

CodePudding user response：

pivot_table could help here.

The 2nd matrix can be directly obtained with:

df.assign(values=1).pivot_table(values='values', index='Infrastructure',
          columns='Investment', aggfunc=sum)

which gives:

Investment      high  low  medium
Infrastructure                   
high             4.0  NaN     NaN
low              NaN  3.0     NaN
medium           NaN  NaN     3.0

For the first one, you could first reset the index to build the expected values:

temp = df.reset_index(2, drop=True).reset_index()
temp['value'] = temp['level_0']   ' '   temp['level_1']
df2 = temp.pivot_table(values = 'value', index='Infrastructure',
                       columns='Investment', aggfunc=list)

In df2 multiple identifiers are grouped in a list. If you want them one per row, you can explode the dataframe:

df2.explode('high').explode('medium').explode('low')

to get

Investment                                    high                      low                                         medium
Infrastructure                                                                                                            
high            AIM/CGE 2.0 ADVANCE_2020_1.5C-2100                      NaN                                            NaN
high                           AIM/CGE 2.0 SSP1-19                      NaN                                            NaN
high                           AIM/CGE 2.0 SSP2-19                      NaN                                            NaN
high              AIM/CGE 2.1 CD-LINKS_NPi2020_400                      NaN                                            NaN
low                                            NaN  IMAGE 3.0.1 IMA15-AGInt                                            NaN
low                                            NaN    IMAGE 3.0.1 IMA15-Def                                            NaN
low                                            NaN    IMAGE 3.0.1 IMA15-Eff                                            NaN
medium                                         NaN                      NaN  AIM/CGE 2.1 TERL_15D_LowCarbonTransportPolicy
medium                                         NaN                      NaN         AIM/CGE 2.1 TERL_15D_NoTransportPolicy
medium                                         NaN                      NaN                               GCAM 4.2 SSP1-19

CodePudding user response：

I did some brainstorming for a while and came up with the solution myself. It is long, but it looks like this.

First, I made the list of scenarios which are high, medium and low for both Infrastructure and Investment respectively:

infra_high = (df[df["Infrastructure"]=="high"].index.get_level_values(0)   " "   df[df["Infrastructure"]=="high"].index.get_level_values(1)).to_list()
infra_medium = (df[df["Infrastructure"]=="medium"].index.get_level_values(0)   " "   df[df["Infrastructure"]=="medium"].index.get_level_values(1)).to_list()
infra_low = (df[df["Infrastructure"]=="low"].index.get_level_values(0)   " "   df[df["Infrastructure"]=="low"].index.get_level_values(1)).to_list()

invest_high = (df[df["Investment"]=="high"].index.get_level_values(0)   " "   df[df["Investment"]=="high"].index.get_level_values(1)).to_list()
invest_medium = (df[df["Investment"]=="medium"].index.get_level_values(0)   " "   df[df["Investment"]=="medium"].index.get_level_values(1)).to_list()
invest_low = (df[df["Investment"]=="low"].index.get_level_values(0)   " "   df[df["Investment"]=="low"].index.get_level_values(1)).to_list()

Next, I did the intersection of the scenarios which are high, high; high, medium and high, low for Infrastructure and Investment as follows:

i1 = list(set(infra_high).intersection(set(invest_high)))
i2 = list(set(infra_high).intersection(set(invest_medium)))
i3 = list(set(infra_high).intersection(set(invest_low)))

i4 = list(set(infra_medium).intersection(set(invest_high)))
i5 = list(set(infra_medium).intersection(set(invest_medium)))
i6 = list(set(infra_medium).intersection(set(invest_low)))

i7 = list(set(infra_low).intersection(set(invest_high)))
i8 = list(set(infra_low).intersection(set(invest_medium)))
i9 = list(set(infra_low).intersection(set(invest_low)))

Next, using the Excel file, I recreated the matrix I wanted to make:

Then, I assigned the list for each cell of the dataframe as follows:

landing_zone1.loc["High","High"] = i1
landing_zone1.loc["High","Medium"] = i2
landing_zone1.loc["High","Low"] = i3

landing_zone1.loc["Medium","High"] = i4
landing_zone1.loc["Medium","Medium"] = i5
landing_zone1.loc["Medium","Low"] = i6

landing_zone1.loc["Low","High"] = i7
landing_zone1.loc["Low","Medium"] = i8
landing_zone1.loc["Low","Low"] = i9

And I did the same for matrix of number of scenarios using len() of each list.