I'm trying to split the values of a column in a pyspark dataframe. Consider that the column size is having values such as '15ML', '20GM' etc. I want them to be splitted in a such a way that, the output values after splitting becomes '15 ML' and '20 GM'. So, basically I'm trying to split the numerical value and it's unit. Please help me by proving a solution for this issue.
-------------- -------------------------
|size |new_size(after splitting)|
-------------- -------------------------
| 100MG| 100 MG|
| 1EA| 1 EA|
| 100MG| 100 MG|
-------------- -------------------------
I have inserted the sample data and the final column format I require. Thanks in advance
I tried using the below code, but I was not getting a proper result from this.
`
from pyspark.sql.functions import split
df_f = products_size_df.withColumn("new_size", split(products_size_df.size, "MG"))
`
CodePudding user response:
If you want just to add space between value and its unit, you can use regexp_replace
like this:
products_size_df.withColumn("new_size", regexp_replace(products_size_df.size, '(\d )', '$1 '))
$1 refers to the number of group, so basically you just find value and add space after it
If you need to create new columns for value and unit, you can take regexp_extract
like this:
products_size_df.withColumn("new_value", regexp_extract(products_size_df.size, '(\d*)', 1))
products_size_df.withColumn("new_unit", regexp_extract(products_size_df.size, '([A-Za-z] )', 1))
make notice that it will return empty string instead NULL if group won't be found, and NULL will be returned only if column has NULL value itself
CodePudding user response:
You can use a udf to split numbers and alphabets characters in a string:
import re
spark = SparkSession.builder.master("local[*]").getOrCreate()
data = [["100MG"], ["1EA"], ["100MG"]]
df = spark.createDataFrame(data).toDF("size")
def split_func(str):
return re.sub("[A-Za-z] ", lambda ele: " " ele[0] " ", str)
split_udf = udf(split_func)
df.withColumn("splitted", split_udf(col("size"))).show()
----- --------
| size|splitted|
----- --------
|100MG| 100 MG |
| 1EA| 1 EA |
|100MG| 100 MG |
----- --------