Home > Mobile >  python BeautifulSoup Wikipedia Webscapping -learning
python BeautifulSoup Wikipedia Webscapping -learning

Time:05-29

I learning Python and BeautifulSoup

I am trying to do some webscraping:

Let me first describe want I am trying to do?

the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks

I am trying to print out the

<span  id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>

I want to print out the text: By market capitalization

Then the text of the table of the banks: Example: By market capitalization

Rank Bank Cap Rate
1 JP Morgan 466.1
2 Bank of China 300

all the way to 50

My code starts out like this:

from bs4 import 
import requests 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text 
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup) 

I believe my problem is more on the html side of things: But I am completely lost: I inspected the element and the tags that I believe to look for are

{section class_='mf-section-2 collapsible-block open-block'}

CodePudding user response:

As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method:

import pandas as pd

df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0,  drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

CodePudding user response:

Close to your goal - Find the heading and than its next table and transform it via pandas.read_html() to dataframe.

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]

or

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
Example
from bs4 import BeautifulSoup
import requests
import panda as pd 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')

header = soup.select_one('h2:has(>#By_market_capitalization)')

print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
Output

By market capitalization

Rank Bank name Market cap(US$ billion)
1 JPMorgan Chase 466.21[5]
2 Industrial and Commercial Bank of China 295.65
3 Bank of America 279.73
4 Wells Fargo 214.34
5 China Construction Bank 207.98
6 Agricultural Bank of China 181.49
7 HSBC Holdings PLC 169.47
8 Citigroup Inc. 163.58
9 Bank of China 151.15
10 China Merchants Bank 133.37
11 Royal Bank of Canada 113.80
12 Toronto-Dominion Bank 106.61

...

  • Related