Home > Mobile >  Web Scraping using Beautiful soup and executing multiple functions to add to a list
Web Scraping using Beautiful soup and executing multiple functions to add to a list

Time:03-17

I'm fairly new to Python and I'm trying to webscrape Facebook.

I have created a function for each section to extract, i.e The Poster Name, Captions etc.

Here is the main part of the code :

 FacebookPosts = [] 


source_data = driver.page_source
bs_data = bs(source_data, 'html.parser')

 NumberofPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')})

def _extract_post_name(bs_data):
    postername = ""
    actualPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')})
    for posts in actualPosts:
        postername = posts.find('strong').text
        #postername.append(paragraphs)
    return postername



 def _extract_post_caption(bs_data):   
    captionblocks = bs_data.find_all('div', {"class": re.compile('^ii04i59q')})
    captions = ""
    for captiondivs in captionblocks:
        caption = captiondivs.find('div', attrs = {'style':'text-align: start;'}).text
        #captions.append(caption)
    return caption



for posts in NumberofPosts:
    post = {
            'Original Poster:' :  _extract_post_name(bs_data),
            'Caption:'         :  _extract_post_caption(bs_data),
            }
    FacebookPosts.append(post)

print(FacebookPosts)

I have other functions for more extraction but ill keep it small for simplicity.

The issue at the moment is, that with this method, only 1 line in the dictionary is being shown and always the same one, when I run the code inside the function without the function it prints multiple times, I know I can append to the list but there would be another issue.

Ultimately what I would like to extract is:

FacebookPosts{
Post1{
Poster Name : Steve
Caption : Text inside Caption
}

Post2: {
Poster Name : Bob
Caption : Please Help me

what's being extracted now is:

    FacebookPosts{
    Poster Name : Steve
    Caption : Text inside Caption
    }
    Poster Name : Steve
    Caption : Text inside Caption
}

For every element found in NumberofPosts

Any help is greatly appreciated, I've been stuck on this problem for days.

I believe that my problem is a lack of knowledge about functions and dictionary/lists.

Like how do you add to a dictionary from multiple sources such as functions and have them in the same set.

CodePudding user response:

Oh I think this might be a simple fix brother.

for posts in NumberofPosts:
    post = {
            'Original Poster:' :  _extract_post_name(bs_data),
            'Caption:'         :  _extract_post_caption(bs_data),
            }
FacebookPosts.append(post)

print(FacebookPosts)

There is an issue here you need to the put the FacebookPosts.append(post) inside the for block else you're only appending the last post

for posts in NumberofPosts:
    post = {
            'Original Poster:' :  _extract_post_name(bs_data),
            'Caption:'         :  _extract_post_caption(bs_data),
            }
    FacebookPosts.append(post)

print(FacebookPosts)

^That should fix it if I'm not mistaken.

CodePudding user response:

I solved the issue. Basically I had to change NumberofPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')}) that element was getting the H2 headers which only contained the Name of the poster. It has now been changed to bs_data.find_all('div', {"class": 'du4w35lb k4urcfbm l9j0dhe7 sjgh65i0'}) which is getting the wrapper of the post. I'll leave the post here just in case someone needs the code. Thanks for the help.

  • Related