My program scrapes from various websites and for the product prices; some of them return the price as \\n\\n 9.99 €\\n \\n
and others return the price as \\n\\n 6.99 €\\n \\n\\n 4.99 €\\n \\n
.
I would like just the number gotten out of the string. I have tried using .strip('\\n')
but it wont seem to remove any of the \\n
s.
CodePudding user response:
To remove the newline characters, you can use string.replace("\n", "")
to replace all \n
characters with the empty string. In your example this would get you " 6.99 € 4.99 € "
.
If you want to get these as actual numbers, you can strip the outer whitespace with string.strip
and then split between Euro symbols with string.split
, for example: s.replace("\n", "").strip().split("€")
. For your example, this would output ['6.99 , ' 4.99 ', '']
, a list of strings.
You can get the actual numbers out with a list comprehension. Tying it all together:
string = "\n\n 6.99 €\n \n\n 4.99 €\n \n"
string_to_list = string.replace("\n", "").strip().split("€")
numbers = [float(s.strip()) for s in string_to_list if s != ""]
For your example, this yields [6.99, 4.99]
.
CodePudding user response:
If these are the only characters that is cluttering your strings, you could remove them, then split the prices into a list.
s = "\\n\\n 6.99 €\\n \\n\\n 4.99 €\\n \\n"
l = s.replace("\\n", "").replace("€", "").split()
This will produce the list ['6.99', '4.99']
CodePudding user response:
Validate your data with a regular expression, short regex. The type of data will influence the type of pattern to be used. Here I used: r'(\d*\.\d{,2})'
which means find a digit (or no digit) followed by . and zero or digits, so it is designed for floating numbers.
import re
string_values = ['\\n\\n 9.99 €\\n \\n', '\\n\\n 6.99 €\\n \\n\\n 4.99 €\\n \\n']
pattern = re.compile(r'(\d*\.\d{,2})')
values = list(map(float, sum([pattern.findall(s) for s in values], [])))
print(values)
Output
[9.99, 6.99, 4.99]
Remarks:
findall
since there are more occurrence of the pattern in the same string... it get them allsum(iterable, [])
is used to flat the nested list (there are better solutions asitertools.chain
, ...)map
is used to cast every string into a float