I came across a (for me and my little experience) weird issue.
I scrape a list that looks like this after being scraped and a little bit cleaned:
['[', '€23600', '€7600', '€3600', '€1600', '€800', '€400']
The list can contain €, $ or £ as currency. I want to get rid of the currency, so I use(d)
while "€" in pricemoney: pricemoney.remove("€")
while "£" in pricemoney: pricemoney.remove("£")
while "$" in pricemoney: pricemoney.remove("$")
which works for $ and £, but not for €. pricemoney is the variable that contains the list. I tried to check if € is maybe a different symbol than the one on the normal keybord, but it seems to be the same
I switched to
pricemoney = [re.sub(r'\D(?=\d )', "", i) for i in pricemoney]
which gets rid of all th currencys in one go (and thus is the better way anyway) but I still can't figure out why the "€" was not catched with the while statement?
CodePudding user response:
The code while "€" in pricemoney: pricemoney.remove("€")
checks if the string "€" is inside the list pricemoney
. What it means is that it will remove any string equals to "€" but not starts with. Same thing applies to the other two symbols you mentioned.
To see that, try to run the following commands and see that it will only remove the first element:
pricemoney = ['€', '€2000']
while "€" in pricemoney: pricemoney.remove("€")
pricemoney = ['$', '$2000']
while "$" in pricemoney: pricemoney.remove("$")
CodePudding user response:
Honestly a regex replacement might be best here:
# -*- coding: utf-8 -*
import re
inp = ['[', '€23600', '€7600', '€3600', '€1600', '€800', '€400']
output = [re.sub(r'[€£$](\d )', r'\1', x) for x in inp]
print(output) # ['[', '23600', '7600', '3600', '1600', '800', '400']