python BeautifulSoup Wikipedia Webscapping -learning-CodePudding

I learning Python and BeautifulSoup

I am trying to do some webscraping:

Let me first describe want I am trying to do?

the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks

I am trying to print out the

<span  id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>

I want to print out the text: By market capitalization

Then the text of the table of the banks: Example: By market capitalization

Rank	Bank	Cap Rate
1	JP Morgan	466.1
2	Bank of China	300

all the way to 50

My code starts out like this:

from bs4 import 
import requests 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text 
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)

I believe my problem is more on the html side of things: But I am completely lost: I inspected the element and the tags that I believe to look for are

{section class_='mf-section-2 collapsible-block open-block'}

CodePudding user response：

As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method:

import pandas as pd

df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0,  drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

CodePudding user response：

Close to your goal - Find the heading and than its next table and transform it via pandas.read_html() to dataframe.

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]

Example

from bs4 import BeautifulSoup
import requests
import panda as pd 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')

header = soup.select_one('h2:has(>#By_market_capitalization)')

print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))

Output

By market capitalization

Rank	Bank name	Market cap(US$ billion)
1	JPMorgan Chase	466.21[5]
2	Industrial and Commercial Bank of China	295.65
3	Bank of America	279.73
4	Wells Fargo	214.34
5	China Construction Bank	207.98
6	Agricultural Bank of China	181.49
7	HSBC Holdings PLC	169.47
8	Citigroup Inc.	163.58
9	Bank of China	151.15
10	China Merchants Bank	133.37
11	Royal Bank of Canada	113.80
12	Toronto-Dominion Bank	106.61

...