Pandas: Accessing multiple columns under different top level column index in Multi-index columns Dat-CodePudding

I'm having troubles figuring out the index for the headings on the table which i wanna scrape and output into a csv file, so i need the column classified under ResidualMaturity and Last and I am only able to get the main heading of the table instead of the sub. I have tried using df[('Yield', 'Last') but am only able to get that particular column and not both.

import pandas as pd
import requests

url = 'http://www.worldgovernmentbonds.com/country/japan/'
r = requests.get(url)
df_list = pd.read_html(r.text, flavor='html5lib')
df = df_list[4]
yc = df[["ResidualMaturity", "Yield"]]
print(yc)

Current output

     ResidualMaturity    Yield                   
   ResidualMaturity     Last    Chg 1M   Chg 6M
0           1 month  -0.114%    9.0 bp   7.4 bp
1          3 months  -0.109%    0.0 bp  -1.9 bp
2          6 months  -0.119%   -0.3 bp  -1.9 bp
3          9 months  -0.119%   10.0 bp   9.9 bp
4            1 year  -0.125%   -0.7 bp   0.9 bp
5           2 years  -0.121%    0.9 bp   1.3 bp
6           3 years  -0.113%    2.2 bp   2.7 bp
7           4 years  -0.094%    2.6 bp   2.1 bp
8           5 years  -0.082%    2.3 bp   1.8 bp
9           6 years  -0.056%    3.4 bp   0.4 bp
10          7 years  -0.029%    5.1 bp  -0.4 bp
11          8 years   0.007%    5.6 bp  -0.7 bp
12          9 years   0.052%    5.6 bp  -1.3 bp
13         10 years   0.087%    4.7 bp  -1.2 bp
14         15 years   0.288%    4.3 bp  -2.4 bp
15         20 years   0.460%    3.7 bp  -1.5 bp
16         30 years   0.689%    3.5 bp   1.6 bp
17         40 years   0.757%    3.5 bp   7.3 bp

Desired Output which i am trying to get

 ResidualMaturity     Last    
    0           1 month  -0.114%   
    1          3 months  -0.109%    
    2          6 months  -0.119%   
    3          9 months  -0.119%  
    4            1 year  -0.125%   
    5           2 years  -0.121%   
    6           3 years  -0.113%   
    7           4 years  -0.094%   
    8           5 years  -0.082%   
    9           6 years  -0.056%   
    10          7 years  -0.029%   
    11          8 years   0.007%   
    12          9 years   0.052%   
    13         10 years   0.087%  
    14         15 years   0.288%   
    15         20 years   0.460%   
    16         30 years   0.689%   
    17         40 years   0.757%

I have tried using df[('Yield', 'Last')] but am only able to get that particular column and not both.

CodePudding user response：

Use pd.IndexSlice together with .loc

idx = pd.IndexSlice
yc.loc[:, idx[:, ['ResidualMaturity', 'Last']]]

Or, use .loc on axis=1, as follows:

idx = pd.IndexSlice
yc.loc(axis=1)[idx[:, ['ResidualMaturity', 'Last']]]

pd.IndexSlice in this way allows us to specify the level 1 column labels without specifying level 0 column labels.

Result:

   ResidualMaturity    Yield
   ResidualMaturity     Last
0           1 month  -0.110%
1          3 months  -0.109%
2          6 months  -0.119%
3          9 months  -0.115%
4            1 year  -0.125%
5           2 years  -0.120%
6           3 years  -0.113%
7           4 years  -0.094%
8           5 years  -0.084%
9           6 years  -0.057%
10          7 years  -0.031%
11          8 years   0.005%
12          9 years   0.050%
13         10 years   0.086%
14         15 years   0.287%
15         20 years   0.461%
16         30 years   0.689%
17         40 years   0.757%

If you don't want to display level 0 column index:

idx = pd.IndexSlice
yc.loc(axis=1)[idx[:, ['ResidualMaturity', 'Last']]].droplevel(0, axis=1)

Result:

   ResidualMaturity     Last
0           1 month  -0.110%
1          3 months  -0.109%
2          6 months  -0.119%
3          9 months  -0.115%
4            1 year  -0.125%
5           2 years  -0.120%
6           3 years  -0.113%
7           4 years  -0.094%
8           5 years  -0.084%
9           6 years  -0.057%
10          7 years  -0.031%
11          8 years   0.005%
12          9 years   0.050%
13         10 years   0.086%
14         15 years   0.287%
15         20 years   0.461%
16         30 years   0.689%
17         40 years   0.757%

CodePudding user response：

Here is the output that I'm getting:

import pandas as pd
import requests

url = 'http://www.worldgovernmentbonds.com/country/japan/'
r = requests.get(url)
df_list = pd.read_html(r.text, flavor='html5lib')
df = df_list[4]
yc = df[df.columns[1:3]].droplevel(0, axis=1)
print(yc)

Output:

   ResidualMaturity     Last
0           1 month  -0.110%
1          3 months  -0.109%
2          6 months  -0.119%
3          9 months  -0.115%
4            1 year  -0.125%
5           2 years  -0.120%
6           3 years  -0.113%
7           4 years  -0.094%
8           5 years  -0.084%
9           6 years  -0.057%
10          7 years  -0.031%
11          8 years   0.005%
12          9 years   0.050%
13         10 years   0.086%
14         15 years   0.287%
15         20 years   0.461%
16         30 years   0.689%
17         40 years   0.757%