pd.read_html(url) - awkward table design-CodePudding

Table headings through the table are being converted into single column headings.

url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The North Coast Koala Management,Valley, Clarence Valley and Taree."
dfs = pd.read_html(url)
df = dfs[0]
df.head()

output

Be great if I could have the High preferred use as a column that assigns to the correct species. Tried reset_index() this did not work. I'm lost for searching can't find anything similar.

CodePudding user response：

The table cannot be easily parsed with read_html because of its unorthodox use of <thead> attribute. You can try luck with BeautifulSoup:

import bs4
import urllib.request

soup = bs4.BeautifulSoup(urllib.request.urlopen(url))
data = [["".join(cell.strings).strip() 
         for cell in row.findAll(['td', 'th'])] 
         for row in soup.findAll('table')[0].findAll('tr')]    
table = pd.DataFrame(data[1:])\
          .rename(columns=dict(enumerate(data[0])))\
          .dropna(how='all')

CodePudding user response：

So I took a look at the link and the table you're trying to get.

The problem with the table in the link is that it contains multiple headers so the .read_html(URL) function, gets all of them and sets those as your header:

so instead of using pandas to read the HTML I used beautiful soup for what you're trying to accomplish.

With beautiful and urllib.requests I got the HTML from the URL and extracted the HTML with the table class name

url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The North Coast Koala Management,Valley, Clarence Valley and Taree."

#load html with urllib
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html.read(), 'lxml')


#get the table you're trying to get based
#on html elements
htmltable = soup.find('table', { 'class' : 'table-striped' })

Then using

Then you can easily reference the type of use and each value common/species name.

Here is the full code:


import pandas as pd
from bs4 import BeautifulSoup
import urllib.request

url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The North Coast Koala Management,Valley, Clarence Valley and Taree."

#load html with urllib
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html.read(), 'lxml')


#get the table you're trying to get based
#on html elements
htmltable = soup.find('table', { 'class' : 'table-striped' })


#modified function taken from: https://stackoverflow.com/a/58274853/6297478
#to fit your data shape in a way that 
#you can use. 
def tableDataText(table):    
    """Parses a html segment started with tag <table> followed 
    by multiple <tr> (table rows) and inner <td> (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one <th> (table header/data) in the first row.
    """
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    

    if headerow: # if there is a header row include first
        trs = trs[1:]
    for tr in trs: # for every table row

        #this part is modified
        #basically we'll get the type of 
        #used based of the second table header
        #in your url table html
        if(rowgetDataText(tr, 'th')):
          last_head = rowgetDataText(tr, 'th')

        #we'll add to the list a dict
        #that contains "common name", "species name", "type" (use type)
        if(rowgetDataText(tr, 'td')):
          row = rowgetDataText(tr, 'td')
          rows.append({headerow[0]: row[0], headerow[1]: row[1], 'type': last_head[0]})
              
    return rows

#we store our results from the function in list_table
list_table = tableDataText(htmltable)

#turn our table into a DataFrame
dftable = pd.DataFrame(list_table)
dftable

I left some comments for you in the code to help you out.

I hope this helps!

CodePudding user response：

Just in addition to @DYZ approach, using css selectors, stripped_strings and find_previous(). This will create a list of dicts that will be transformed into a dataframe:

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The North Coast Koala Management,Valley, Clarence Valley and Taree."

data = []
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('table tbody tr'):
    data.append(
        dict(
            zip(
                soup.table.thead.stripped_strings,
                [e.find_previous('th').get_text(strip=True)] list(e.stripped_strings)
            )
        )
    )

pd.DataFrame(data)

	Common name	Species name	High preferred use
0	High preferred use	Grey gum	Eucalyptus biturbinata
1	High preferred use	Large-fruited grey gum	Eucalyptus canaliculata
...	...	...	...
107	Occasional use	Broad-leaved paperbark	Melaleuca quinquenervia
108	Occasional use	nan	nan