I have a column in my pyspark dataframe which contains the price of my products and the currency they are sold in. As 99% of the products are sold in dollars, let's use the dollar example.
products_price
--------------- -----------
| product_id | price |
--------------- -----------
| 0001 | 10|USD |
| 0002 | 19.9|USD |
| 0003 | 14.45|USD |
| 0004 | 17.75|USD |
| 0005 | 98.99|USD |
| 0006 | 5.60|USD |
| 0007 | 20.50|USD |
--------------- -----------
I tried a couple of things like this:
from pyspark.sql.functions import split
products_price = (
products_price
.withColumn("new_price", split(col("price"), "|").getItem(0)
)
But nothing works. This snippet above just return the first character of the price column. It's weird because some people said it worked. I just need to remove the |USD
and leave the numbers. Could you guys please help me with this?
CodePudding user response:
the split function uses regular expression and | is a reserved symbol in regex. you should use split("price", "\|")
to get what you need.