Home > database >  I need just the prices extracted from the strings below
I need just the prices extracted from the strings below

Time:04-02

My program scrapes from various websites and for the product prices; some of them return the price as \\n\\n 9.99 €\\n \\n and others return the price as \\n\\n 6.99 €\\n \\n\\n 4.99 €\\n \\n.

I would like just the number gotten out of the string. I have tried using .strip('\\n') but it wont seem to remove any of the \\n s.

CodePudding user response:

To remove the newline characters, you can use string.replace("\n", "") to replace all \n characters with the empty string. In your example this would get you " 6.99 € 4.99 € ".

If you want to get these as actual numbers, you can strip the outer whitespace with string.strip and then split between Euro symbols with string.split, for example: s.replace("\n", "").strip().split("€"). For your example, this would output ['6.99 , ' 4.99 ', ''], a list of strings.

You can get the actual numbers out with a list comprehension. Tying it all together:

string = "\n\n 6.99 €\n \n\n 4.99 €\n \n"
string_to_list = string.replace("\n", "").strip().split("€")
numbers = [float(s.strip()) for s in string_to_list if s != ""]

For your example, this yields [6.99, 4.99].

CodePudding user response:

If these are the only characters that is cluttering your strings, you could remove them, then split the prices into a list.

s = "\\n\\n          6.99 €\\n        \\n\\n        4.99 €\\n      \\n"

l = s.replace("\\n", "").replace("€", "").split()

This will produce the list ['6.99', '4.99']

CodePudding user response:

Validate your data with a regular expression, short regex. The type of data will influence the type of pattern to be used. Here I used: r'(\d*\.\d{,2})' which means find a digit (or no digit) followed by . and zero or digits, so it is designed for floating numbers.

import re

string_values = ['\\n\\n        9.99 €\\n      \\n', '\\n\\n          6.99 €\\n        \\n\\n        4.99 €\\n      \\n']

pattern = re.compile(r'(\d*\.\d{,2})')

values = list(map(float, sum([pattern.findall(s) for s in values], [])))

print(values)

Output

[9.99, 6.99, 4.99]

Remarks:

  • findall since there are more occurrence of the pattern in the same string... it get them all
  • sum(iterable, []) is used to flat the nested list (there are better solutions as itertools.chain, ...)
  • map is used to cast every string into a float
  • Related