Home > Software engineering >  Python web scrape url to dataframe
Python web scrape url to dataframe

Time:09-28

I want to web scrape a website (https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp) and create a dataframe. This is the dataframe that I want:

name                   text
M. le président        La séance est...
M. le président        L'ordre du jour...
M. Jean-Marc Ayrault   Je demande la ...

Initially I thought that I should use BeautifulSoup, and I started to write the following code:

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.text, 'html.parser')
first=soup_data.find_all('div')
name=first.b.text

But I obtained the error:

AttributeError: ResultSet object has no attribute 'b'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Because I could not go further, then I thought that the best idea has to get the html, and work in a similar way as if I had a xml file:

import urllib
import xml.etree.ElementTree as ET
import pandas as pd
import lxml
from lxml import etree

urllib.request.urlretrieve("https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp", "file.txt")

d = {'head': ['title'],
    'body':['b', 'p']}

tree = ET.parse("file.txt")
root = tree.getroot()
# initialize two lists: `cols` and `data`
cols, data = list(), list()
# loop through d.items
for k, v in d.items():
    # find child
    child = root.find(f'{{*}}{k}')
    # use iter to check each descendant (`elem`)
    for elem in child.iter():
        # get `tag_end` for each descendant, e.g. `texte` in "{http://schemas.assemblee-nationale.fr/referentiel}texte"
        tag_end = elem.tag.split('}')[-1]
        # check if `tag_end` in `v(alue)`
        if tag_end in v:
            # add `tag_end` and `elem.text` to appropriate list
            cols.append(tag_end)
            data.append(elem.text)
df = pd.DataFrame(data).T

But I obtain the error: "not well-formed (invalid token)". Here is a summary of the html:

<html>
 <head>
  <title> Assemblée Nationale - Séance du mercredi ... </title>
 </head>
 <body>
  <div id="englobe">
   <p>
    <orateur>
     <b> M. le président </b>
    </orateur>
     La séance est...
   </p>
   <p>
    <orateur>
     <b> M. le président </b>
    </orateur>
     L'ordre du jour...
   </p>
  </div>
 </body>
</html>

How I should web scrape the website? I will want to do this for several similar websites.

CodePudding user response:

So, your approach with beautifulsoup is definitely the way to go. The error already points you towards your error: what you call first is really of type bs4.element.ResultSet, which -- as the name suggests -- is not a single element. The easiest way to access the actual results is to loop through it using a for loop.

I'm not sure if you really need to go for the div's as, really, you're looking for the p's that include an orateur element (long story short: the first for-loop is unnecessary and you could heavily simplify this further), but anyways, here's how you can access the elements you want

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.text, 'html.parser')
list_soup_div=soup_data.find_all('div')

for item_soup_div in list_soup_div:
    list_soup_sub_orateur = item_soup_div.find_all('orateur')
    # Check on whether the <div> contains an <orateur> element
    if len(list_soup_sub_orateur):
        for item_soup_p in item_soup_div.find_all('p'):
            list_orateur = item_soup_p.find_all('orateur')
            if len(list_orateur):
                print(item_soup_p)

After that, you only need to filter out the lines with links, append to a dictionary, convert to a datarame and voilà, there's your desired dataframe. Hope that helps! If not, do let me know.

CodePudding user response:

find_all method is used to find all elements with filters you want like the div tag in your example and you can't extract the text of all elements.

you just have to make for loop and extract the text of each element and store them into a list then add it as a column in your data frame like this.

import requests
from bs4 import BeautifulSoup
import pandas as pd

df = pd.DataFrame()
names_list = []

url = "https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp"

r=requests.get(url)

soup_data=BeautifulSoup(r.text, 'html.parser')

names=soup_data.find_all('div')

for name in names:
    names_list.append(name.text)

df['name']=names_list
  • Related