I learning Python and BeautifulSoup
I am trying to do some webscraping:
Let me first describe want I am trying to do?
the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks
I am trying to print out the
<span id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>
I want to print out the text: By market capitalization
Then the text of the table of the banks: Example: By market capitalization
Rank | Bank | Cap Rate |
---|---|---|
1 | JP Morgan | 466.1 |
2 | Bank of China | 300 |
all the way to 50
My code starts out like this:
from bs4 import
import requests
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)
I believe my problem is more on the html side of things: But I am completely lost: I inspected the element and the tags that I believe to look for are
{section class_='mf-section-2 collapsible-block open-block'}
CodePudding user response:
As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method:
import pandas as pd
df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0, drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))
CodePudding user response:
Close to your goal - Find the heading and than its next table
and transform it via pandas.read_html()
to dataframe.
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]
or
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
Example
from bs4 import BeautifulSoup
import requests
import panda as pd
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
header = soup.select_one('h2:has(>#By_market_capitalization)')
print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
Output
By market capitalization
Rank | Bank name | Market cap(US$ billion) |
---|---|---|
1 | JPMorgan Chase | 466.21[5] |
2 | Industrial and Commercial Bank of China | 295.65 |
3 | Bank of America | 279.73 |
4 | Wells Fargo | 214.34 |
5 | China Construction Bank | 207.98 |
6 | Agricultural Bank of China | 181.49 |
7 | HSBC Holdings PLC | 169.47 |
8 | Citigroup Inc. | 163.58 |
9 | Bank of China | 151.15 |
10 | China Merchants Bank | 133.37 |
11 | Royal Bank of Canada | 113.80 |
12 | Toronto-Dominion Bank | 106.61 |
...