how not to extract the "spaces" from the html page, which appears as "\n" in th-CodePudding

code that extracts data from the page

enter image description here

unexpected result .json enter image description here

how to remove the '\n', as in this example? enter image description here

url https://storage.googleapis.com/infosimples-public/commercia/case/product.html#

CodePudding user response：

First off, please don't use images when posting your code. It is easier for you and everyone else if you simply copy and paste it.

All you need to do is split categories by the new line character and then call str.strip to get rid of any extra whitespace

For example:

resposta_final['categories'] = [i.strip() for i in resposta_final['categories'].split("\n") if i.strip() != ""]

Add that line to the end of your script and it should be similar to your example.

CodePudding user response：

You can't use new lines inside of a json files (because you can't have a multiline string inside json), you should use lists like so {"categories":["Commercia", "Health & Care", ...]}

CodePudding user response：

What Alexander suggested works, but i think using this is better

resposta_final['categories'] = [element.get_text() for element in parsed_html.select(".current-category a")]