Home > Enterprise >  Scraping Text incl. Emojis with BeautifulSoup
Scraping Text incl. Emojis with BeautifulSoup

Time:03-09

I would appreciate your help on this issue. I am trying to scrape forum posts including the emojis. Getting the text is working, but the emojis are not included, and I would like to scrape them together with the text using the function that you can see below. THANK YOU for your help!

For the link below, the images are called class = 'smilies'.

Here is my code:

### import
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

### first, create an empty dataframe where the final results will be stored
df = pd.DataFrame()

### second, create a function to get all the user comments
def get_comments(lst_name):
# find all user comments and save them to a list
  comment = bs.find_all(class_= "content")
# iterate over the list comment to get the text and strip the strings
  for c in comment:
        lst_name.append(c.get_text(strip = True))
# return the list
  return lst_name

### third, start the scraping
link = 'https://vegan-forum.de/viewtopic.php?f=54&t=8325&start=120'

# create the lists for the functions
user_comments = []
   
# get the content
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, 'html.parser')
        
# call the functions to get the information
get_comments(user_comments)

# create a pandas dataframe for the comments
comments_dict = {
    'user_comments': user_comments
}

df_comments_info = pd.DataFrame(data=comments_dict)
        
# append the temporary dataframe to the dataframe we created earlier outside the for loop
df = df.append(df_comments_info)

CodePudding user response:

One way to do it is to replace all <img > with text. For example:

### import
import requests
import pandas as pd
from bs4 import BeautifulSoup

### first, create an empty dataframe where the final results will be stored
df = pd.DataFrame()

### second, create a function to get all the user comments
def get_comments(lst_name):
    # replace all <img > with text:
    for img in bs.select("img.smilies"):
        img.replace_with(img["alt"])
    bs.smooth()

    # find all user comments and save them to a list
    comment = bs.find_all(class_="content")
    # iterate over the list comment to get the text and strip the strings
    for c in comment:
        lst_name.append(c.get_text(strip=True))
    # return the list
    return lst_name


### third, start the scraping
link = "https://vegan-forum.de/viewtopic.php?f=54&t=8325&start=120"

# create the lists for the functions
user_comments = []

# get the content
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, "html.parser")

# call the functions to get the information
get_comments(user_comments)

# create a pandas dataframe for the comments
comments_dict = {"user_comments": user_comments}

df_comments_info = pd.DataFrame(data=comments_dict)

# append the temporary dataframe to the dataframe we created earlier outside the for loop
df = df.append(df_comments_info)
print(df)

Prints:


...

Danke!Erst Mal sollte ich bei den Tabletten bleiben. Hab die ja schon Mal genommen. Genau die gleichen wie ihr mir empfiehlt. Aber die sind fast leer und auf Amazon gibt's die nicht mehr.Soll ich bei der Quelle nachfragen oder denkst du findest es günstiger? :)

...
  • Related