I'm having troubles figuring out the index for the headings on the table which i wanna scrape and output into a csv file, so i need the column classified under ResidualMaturity
and Last
and I am only able to get the main heading of the table instead of the sub. I have tried using df[('Yield', 'Last')
but am only able to get that particular column and not both.
import pandas as pd
import requests
url = 'http://www.worldgovernmentbonds.com/country/japan/'
r = requests.get(url)
df_list = pd.read_html(r.text, flavor='html5lib')
df = df_list[4]
yc = df[["ResidualMaturity", "Yield"]]
print(yc)
Current output
ResidualMaturity Yield
ResidualMaturity Last Chg 1M Chg 6M
0 1 month -0.114% 9.0 bp 7.4 bp
1 3 months -0.109% 0.0 bp -1.9 bp
2 6 months -0.119% -0.3 bp -1.9 bp
3 9 months -0.119% 10.0 bp 9.9 bp
4 1 year -0.125% -0.7 bp 0.9 bp
5 2 years -0.121% 0.9 bp 1.3 bp
6 3 years -0.113% 2.2 bp 2.7 bp
7 4 years -0.094% 2.6 bp 2.1 bp
8 5 years -0.082% 2.3 bp 1.8 bp
9 6 years -0.056% 3.4 bp 0.4 bp
10 7 years -0.029% 5.1 bp -0.4 bp
11 8 years 0.007% 5.6 bp -0.7 bp
12 9 years 0.052% 5.6 bp -1.3 bp
13 10 years 0.087% 4.7 bp -1.2 bp
14 15 years 0.288% 4.3 bp -2.4 bp
15 20 years 0.460% 3.7 bp -1.5 bp
16 30 years 0.689% 3.5 bp 1.6 bp
17 40 years 0.757% 3.5 bp 7.3 bp
Desired Output which i am trying to get
ResidualMaturity Last
0 1 month -0.114%
1 3 months -0.109%
2 6 months -0.119%
3 9 months -0.119%
4 1 year -0.125%
5 2 years -0.121%
6 3 years -0.113%
7 4 years -0.094%
8 5 years -0.082%
9 6 years -0.056%
10 7 years -0.029%
11 8 years 0.007%
12 9 years 0.052%
13 10 years 0.087%
14 15 years 0.288%
15 20 years 0.460%
16 30 years 0.689%
17 40 years 0.757%
I have tried using df[('Yield', 'Last')]
but am only able to get that particular column and not both.
CodePudding user response:
Use pd.IndexSlice
together with .loc
idx = pd.IndexSlice
yc.loc[:, idx[:, ['ResidualMaturity', 'Last']]]
Or, use .loc
on axis=1
, as follows:
idx = pd.IndexSlice
yc.loc(axis=1)[idx[:, ['ResidualMaturity', 'Last']]]
pd.IndexSlice
in this way allows us to specify the level 1 column labels without specifying level 0 column labels.
Result:
ResidualMaturity Yield
ResidualMaturity Last
0 1 month -0.110%
1 3 months -0.109%
2 6 months -0.119%
3 9 months -0.115%
4 1 year -0.125%
5 2 years -0.120%
6 3 years -0.113%
7 4 years -0.094%
8 5 years -0.084%
9 6 years -0.057%
10 7 years -0.031%
11 8 years 0.005%
12 9 years 0.050%
13 10 years 0.086%
14 15 years 0.287%
15 20 years 0.461%
16 30 years 0.689%
17 40 years 0.757%
If you don't want to display level 0 column index:
idx = pd.IndexSlice
yc.loc(axis=1)[idx[:, ['ResidualMaturity', 'Last']]].droplevel(0, axis=1)
Result:
ResidualMaturity Last
0 1 month -0.110%
1 3 months -0.109%
2 6 months -0.119%
3 9 months -0.115%
4 1 year -0.125%
5 2 years -0.120%
6 3 years -0.113%
7 4 years -0.094%
8 5 years -0.084%
9 6 years -0.057%
10 7 years -0.031%
11 8 years 0.005%
12 9 years 0.050%
13 10 years 0.086%
14 15 years 0.287%
15 20 years 0.461%
16 30 years 0.689%
17 40 years 0.757%
CodePudding user response:
Here is the output that I'm getting:
import pandas as pd
import requests
url = 'http://www.worldgovernmentbonds.com/country/japan/'
r = requests.get(url)
df_list = pd.read_html(r.text, flavor='html5lib')
df = df_list[4]
yc = df[df.columns[1:3]].droplevel(0, axis=1)
print(yc)
Output:
ResidualMaturity Last
0 1 month -0.110%
1 3 months -0.109%
2 6 months -0.119%
3 9 months -0.115%
4 1 year -0.125%
5 2 years -0.120%
6 3 years -0.113%
7 4 years -0.094%
8 5 years -0.084%
9 6 years -0.057%
10 7 years -0.031%
11 8 years 0.005%
12 9 years 0.050%
13 10 years 0.086%
14 15 years 0.287%
15 20 years 0.461%
16 30 years 0.689%
17 40 years 0.757%