Need to split a string containing numbers and alphabets into two-CodePudding

I'm trying to split the values of a column in a pyspark dataframe. Consider that the column size is having values such as '15ML', '20GM' etc. I want them to be splitted in a such a way that, the output values after splitting becomes '15 ML' and '20 GM'. So, basically I'm trying to split the numerical value and it's unit. Please help me by proving a solution for this issue.

 -------------- ------------------------- 
|size          |new_size(after splitting)|
 -------------- ------------------------- 
|         100MG|                   100 MG|
|           1EA|                     1 EA|
|         100MG|                   100 MG|
 -------------- -------------------------

I have inserted the sample data and the final column format I require. Thanks in advance

I tried using the below code, but I was not getting a proper result from this.

from pyspark.sql.functions import split 

df_f = products_size_df.withColumn("new_size", split(products_size_df.size, "MG"))

CodePudding user response：

If you want just to add space between value and its unit, you can use regexp_replace like this:

products_size_df.withColumn("new_size", regexp_replace(products_size_df.size, '(\d )', '$1 '))

$1 refers to the number of group, so basically you just find value and add space after it

If you need to create new columns for value and unit, you can take regexp_extract like this:

products_size_df.withColumn("new_value", regexp_extract(products_size_df.size, '(\d*)', 1))

products_size_df.withColumn("new_unit", regexp_extract(products_size_df.size, '([A-Za-z] )', 1))

make notice that it will return empty string instead NULL if group won't be found, and NULL will be returned only if column has NULL value itself

CodePudding user response：

You can use a udf to split numbers and alphabets characters in a string:

import re

spark = SparkSession.builder.master("local[*]").getOrCreate()
data = [["100MG"], ["1EA"], ["100MG"]]
df = spark.createDataFrame(data).toDF("size")

def split_func(str):
    return re.sub("[A-Za-z] ", lambda ele: " "   ele[0]   " ", str)

split_udf = udf(split_func)

df.withColumn("splitted", split_udf(col("size"))).show()

 ----- -------- 
| size|splitted|
 ----- -------- 
|100MG| 100 MG |
|  1EA|   1 EA |
|100MG| 100 MG |
 ----- --------