I'm fairly new to Python and I'm trying to webscrape Facebook.
I have created a function for each section to extract, i.e The Poster Name, Captions etc.
Here is the main part of the code :
FacebookPosts = []
source_data = driver.page_source
bs_data = bs(source_data, 'html.parser')
NumberofPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')})
def _extract_post_name(bs_data):
postername = ""
actualPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')})
for posts in actualPosts:
postername = posts.find('strong').text
#postername.append(paragraphs)
return postername
def _extract_post_caption(bs_data):
captionblocks = bs_data.find_all('div', {"class": re.compile('^ii04i59q')})
captions = ""
for captiondivs in captionblocks:
caption = captiondivs.find('div', attrs = {'style':'text-align: start;'}).text
#captions.append(caption)
return caption
for posts in NumberofPosts:
post = {
'Original Poster:' : _extract_post_name(bs_data),
'Caption:' : _extract_post_caption(bs_data),
}
FacebookPosts.append(post)
print(FacebookPosts)
I have other functions for more extraction but ill keep it small for simplicity.
The issue at the moment is, that with this method, only 1 line in the dictionary is being shown and always the same one, when I run the code inside the function without the function it prints multiple times, I know I can append to the list but there would be another issue.
Ultimately what I would like to extract is:
FacebookPosts{
Post1{
Poster Name : Steve
Caption : Text inside Caption
}
Post2: {
Poster Name : Bob
Caption : Please Help me
what's being extracted now is:
FacebookPosts{
Poster Name : Steve
Caption : Text inside Caption
}
Poster Name : Steve
Caption : Text inside Caption
}
For every element found in NumberofPosts
Any help is greatly appreciated, I've been stuck on this problem for days.
I believe that my problem is a lack of knowledge about functions and dictionary/lists.
Like how do you add to a dictionary from multiple sources such as functions and have them in the same set.
CodePudding user response:
Oh I think this might be a simple fix brother.
for posts in NumberofPosts:
post = {
'Original Poster:' : _extract_post_name(bs_data),
'Caption:' : _extract_post_caption(bs_data),
}
FacebookPosts.append(post)
print(FacebookPosts)
There is an issue here you need to the put the FacebookPosts.append(post) inside the for block else you're only appending the last post
for posts in NumberofPosts:
post = {
'Original Poster:' : _extract_post_name(bs_data),
'Caption:' : _extract_post_caption(bs_data),
}
FacebookPosts.append(post)
print(FacebookPosts)
^That should fix it if I'm not mistaken.
CodePudding user response:
I solved the issue. Basically I had to change NumberofPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')}) that element was getting the H2 headers which only contained the Name of the poster. It has now been changed to bs_data.find_all('div', {"class": 'du4w35lb k4urcfbm l9j0dhe7 sjgh65i0'}) which is getting the wrapper of the post. I'll leave the post here just in case someone needs the code. Thanks for the help.