Cannot convert html content of a data frame in to text-CodePudding

I have a column with HTML values in a data frame like below.

print(df['Body'])

0    <html><head>\r\n<meta http-equiv="Content-Type...
1    <html xmlns:v="urn:schemas-microsoft-com:vml" ...
2    <html>\r\n<head>\r\n<meta http-equiv="Content-...
3    <meta http-equiv="Content-Type" content="text/...
Name: Body, dtype: object

But when I going to convert them in to a plain text like below.

from selectolax.parser import HTMLParser
from bs4 import BeautifulSoup

soup = BeautifulSoup(df['Body'])
print(soup.get_text())

But I am ending up with below error.How should I improve this code to get the converted plain text of whole column?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-44-4c34d2bd9f1f> in <module>
----> 1 soup = BeautifulSoup(df['Body'])
      2 print(soup.get_text())

~\anaconda3\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
    251         if builder is None:
    252             builder = builder_class(**kwargs)
--> 253             if not original_builder and not (
    254                     original_features == builder.NAME or
    255                     original_features in builder.ALTERNATE_NAMES

~\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1440     @final
   1441     def __nonzero__(self):
-> 1442         raise ValueError(
   1443             f"The truth value of a {type(self).__name__} is ambiguous. "
   1444             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

CodePudding user response：

You need to use Series.apply to apply your parsing on each cell of the column. Here's an example, use your own logic in parse_cell method

from bs4 import BeautifulSoup


def parse_cell(content):
    return BeautifulSoup(content, features="html.parser").get_text()


df['plain'] = df['Body'].apply(parse_cell)
print(df)