I grab some data from the web and they all look good. However,once I tried to handle the data and make some operations on their string. The final output showed that some characters become Unicode code. How can I fix it?
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.fed.cuhk.edu.hk/cri/faculty/prof-lee-kit-bing-icy/')
soup = BeautifulSoup(r.text)
ref= soup.select('h5:-soup-contains("Selected Publications") ~ ol:nth-of-type(1) li')[-1]
publication_dict= {}
#journal page and periodal
if ref.text[ref.text.find(ref.em.text) len(ref.em.text) 2:-1] == "":
publication_dict['remamin_information'] = None
else:
if periodical != None:
publication_dict['remamin_information'] = (periodical ref.text[ref.text.find(ref.em.text) len(ref.em.text):-1])
else:
publication_dict['remamin_information'] = (ref.text[ref.text.find(ref.em.text) len(ref.em.text):-1])
publication_dict
CodePudding user response:
When you print a list
or dict
, Python uses a debug representation for display of the elements to help identify unprintable characters. If you actually print
the string, you'll see the display representation:
>>> d = {'remamin_information':',\xa017(2), 69-85.\r\n '}
>>> d # display the dict. Elements use debug representation.
>>> d['remamin_information'] # The REPL uses a debug representation
',\xa017(2), 69-85.\r\n '
>>> print(d['remamin_information']) # the \xa0 is actually a NO-BREAK SPACE
, 17(2), 69-85. # and the \r\n becomes a line break
There's nothing to "convert back to normal". Just make sure to print()
strings to see their display representation.