Home > Back-end >  How to split the text in a pyspark column using a delimiter?
How to split the text in a pyspark column using a delimiter?

Time:09-06

I have a column in my pyspark dataframe which contains the price of my products and the currency they are sold in. As 99% of the products are sold in dollars, let's use the dollar example.

products_price
 --------------- -----------  
| product_id    | price     |
 --------------- ----------- 
| 0001          | 10|USD    |
| 0002          | 19.9|USD  |
| 0003          | 14.45|USD |
| 0004          | 17.75|USD |
| 0005          | 98.99|USD |
| 0006          | 5.60|USD  |
| 0007          | 20.50|USD |
 --------------- ----------- 

I tried a couple of things like this:

from pyspark.sql.functions import split
 
products_price = (
   products_price
  .withColumn("new_price", split(col("price"), "|").getItem(0)
)

But nothing works. This snippet above just return the first character of the price column. It's weird because some people said it worked. I just need to remove the |USD and leave the numbers. Could you guys please help me with this?

CodePudding user response:

the split function uses regular expression and | is a reserved symbol in regex. you should use split("price", "\|") to get what you need.

  • Related