Home > Back-end >  Cannot convert html content of a data frame in to text
Cannot convert html content of a data frame in to text

Time:10-24

I have a column with HTML values in a data frame like below.

print(df['Body'])

0    <html><head>\r\n<meta http-equiv="Content-Type...
1    <html xmlns:v="urn:schemas-microsoft-com:vml" ...
2    <html>\r\n<head>\r\n<meta http-equiv="Content-...
3    <meta http-equiv="Content-Type" content="text/...
Name: Body, dtype: object

But when I going to convert them in to a plain text like below.

from selectolax.parser import HTMLParser
from bs4 import BeautifulSoup

soup = BeautifulSoup(df['Body'])
print(soup.get_text())

But I am ending up with below error.How should I improve this code to get the converted plain text of whole column?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-44-4c34d2bd9f1f> in <module>
----> 1 soup = BeautifulSoup(df['Body'])
      2 print(soup.get_text())

~\anaconda3\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
    251         if builder is None:
    252             builder = builder_class(**kwargs)
--> 253             if not original_builder and not (
    254                     original_features == builder.NAME or
    255                     original_features in builder.ALTERNATE_NAMES

~\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1440     @final
   1441     def __nonzero__(self):
-> 1442         raise ValueError(
   1443             f"The truth value of a {type(self).__name__} is ambiguous. "
   1444             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

CodePudding user response:

You need to use Series.apply to apply your parsing on each cell of the column. Here's an example, use your own logic in parse_cell method

from bs4 import BeautifulSoup


def parse_cell(content):
    return BeautifulSoup(content, features="html.parser").get_text()


df['plain'] = df['Body'].apply(parse_cell)
print(df)
  • Related