How to scrape table data with th and td with BeautifulSoup?-CodePudding

Am new to programming and have been trying to practice web scrapping. Found a example where one of the columns I wish to have in my out put is part of the table header. I am able to extract all the table data I wish, but have been unable to get the Year dates to show.

from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page
import pandas as pd
"https://en.wikipedia.org/wiki/World_population"
data = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
tables = soup.find_all('table')
len(tables)
for index,table in enumerate(tables):
    if ("Global annual population growth" in str(table)):
        table_index = index
print(table_index)

print(tables[table_index].prettify())

population_data = pd.DataFrame(columns=["Year","Population","Growth"])

for row in tables[table_index].tbody.find_all('tr'):
    col = row.find_all('td')
    if (col !=[]):
        Population = col[0].text.strip()
        Growth = col[1].text.strip()
        population_data = population_data.append({"Population":Population,"Growth":Growth}, ignore_index= True)
        
population_data

CodePudding user response：

You could use pandas directly here to get your goal with pandas.read_html() to scrape the table and pandas.T to transform it:

import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/World_population')[0].T.reset_index()
df.columns = df.loc[0]
df = df[1:]
df

or same result with BeautifulSoup and stripped_strings:

import requests
import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://en.wikipedia.org/wiki/World_population').text)

pd.DataFrame(
    {list(e.stripped_strings)[0]: list(e.stripped_strings)[1:] for e in soup.table.select('tr')}
)

Output

Population	Year	Years elapsed
1	1804	200,000
2	1930	126
3	1960	30
4	1974	14
5	1987	13
6	1999	12
7	2011	12
8	2022	11
9	2037	15
10	2057	20

CodePudding user response：

Actually it's because you are scraping <td> in this line:

col = row.find_all('td')

But if you will take a look at <tr> in developer tools(F12), you can see that table also contains <th> tag which keep the year and which you are not scraping. So everything that you have to do is add this line after If condition:

year = row.find('th').text, and after that you can append it in population data