I need just the prices extracted from the strings below-CodePudding

My program scrapes from various websites and for the product prices; some of them return the price as \\n\\n 9.99 €\\n \\n and others return the price as \\n\\n 6.99 €\\n \\n\\n 4.99 €\\n \\n.

I would like just the number gotten out of the string. I have tried using .strip('\\n') but it wont seem to remove any of the \\n s.

CodePudding user response：

To remove the newline characters, you can use string.replace("\n", "") to replace all \n characters with the empty string. In your example this would get you " 6.99 € 4.99 € ".

If you want to get these as actual numbers, you can strip the outer whitespace with string.strip and then split between Euro symbols with string.split, for example: s.replace("\n", "").strip().split("€"). For your example, this would output ['6.99 , ' 4.99 ', ''], a list of strings.

You can get the actual numbers out with a list comprehension. Tying it all together:

string = "\n\n 6.99 €\n \n\n 4.99 €\n \n"
string_to_list = string.replace("\n", "").strip().split("€")
numbers = [float(s.strip()) for s in string_to_list if s != ""]

For your example, this yields [6.99, 4.99].

CodePudding user response：

If these are the only characters that is cluttering your strings, you could remove them, then split the prices into a list.

s = "\\n\\n          6.99 €\\n        \\n\\n        4.99 €\\n      \\n"

l = s.replace("\\n", "").replace("€", "").split()

This will produce the list ['6.99', '4.99']

CodePudding user response：

Validate your data with a regular expression, short regex. The type of data will influence the type of pattern to be used. Here I used: r'(\d*\.\d{,2})' which means find a digit (or no digit) followed by . and zero or digits, so it is designed for floating numbers.

import re

string_values = ['\\n\\n        9.99 €\\n      \\n', '\\n\\n          6.99 €\\n        \\n\\n        4.99 €\\n      \\n']

pattern = re.compile(r'(\d*\.\d{,2})')

values = list(map(float, sum([pattern.findall(s) for s in values], [])))

print(values)

Output

[9.99, 6.99, 4.99]

Remarks:

findall since there are more occurrence of the pattern in the same string... it get them all
sum(iterable, []) is used to flat the nested list (there are better solutions as itertools.chain, ...)
map is used to cast every string into a float