I have a column with HTML values in a data frame like below.
print(df['Body'])
0 <html><head>\r\n<meta http-equiv="Content-Type...
1 <html xmlns:v="urn:schemas-microsoft-com:vml" ...
2 <html>\r\n<head>\r\n<meta http-equiv="Content-...
3 <meta http-equiv="Content-Type" content="text/...
Name: Body, dtype: object
But when I going to convert them in to a plain text like below.
from selectolax.parser import HTMLParser
from bs4 import BeautifulSoup
soup = BeautifulSoup(df['Body'])
print(soup.get_text())
But I am ending up with below error.How should I improve this code to get the converted plain text of whole column?
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-44-4c34d2bd9f1f> in <module>
----> 1 soup = BeautifulSoup(df['Body'])
2 print(soup.get_text())
~\anaconda3\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
251 if builder is None:
252 builder = builder_class(**kwargs)
--> 253 if not original_builder and not (
254 original_features == builder.NAME or
255 original_features in builder.ALTERNATE_NAMES
~\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1440 @final
1441 def __nonzero__(self):
-> 1442 raise ValueError(
1443 f"The truth value of a {type(self).__name__} is ambiguous. "
1444 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
CodePudding user response:
You need to use Series.apply
to apply your parsing on each cell of the column. Here's an example, use your own logic in parse_cell
method
from bs4 import BeautifulSoup
def parse_cell(content):
return BeautifulSoup(content, features="html.parser").get_text()
df['plain'] = df['Body'].apply(parse_cell)
print(df)