Home > Enterprise >  Separate string by white space in pyspark
Separate string by white space in pyspark

Time:01-31

I have column with search queries that are represented by strings. I want to separate every string to different work.

Let say I have this data frame:

import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
  
df = spark.read.option("header", "true") \
    .option("delimiter", "\t") \
    .option("inferSchema", "true") \
    .csv("/content/drive/MyDrive/my_data.txt")
    



data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))

from pyspark.sql.functions import array_distinct

from pyspark.sql.functions import udf

data = data.withColumn("New_Data", array_distinct("Query"))

Z = data.drop(data.Query) 

 ------ ------------------------ 
|AnonID|            New_Data    |
 ------ ------------------------ 
|   142|[Big House, Green frog] |
 ------ ------------------------ 

And I want output like that:

 ------ -------------------------- 
|AnonID|            New_Data      |
 ------ -------------------------- 
|   142|[Big, House, Green, frog] |
 ------ -------------------------- 

I have tried to search In older posts but I was able to find only something that separates each word to different column and it's not what I want.

CodePudding user response:

To separate the elements in an array and split each string into separate words, you can use the explode and split functions in Spark. The exploded elements can then be combined back into an array using the array function.

from pyspark.sql.functions import explode, split, array

data = data.withColumn("Words", explode(split(data["New_Data"], " ")))
data = data.groupBy("AnonID").agg(array(data["Words"]).alias("New_Data"))

CodePudding user response:

You can do the collect_list first and then use the transform function to split the array elements and then flatten the elements and then finally apply array_distinct. Please check out the code and output below.

df=spark.createDataFrame([[142,"Big House"],[142,"Big Green Frog"]],["AnonID","Query"])

import pyspark.sql.functions as F
data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))

data.withColumn("Query",F.array_distinct(flatten(transform(data["Query"], lambda x: split(x, " "))))).show(2,False)

 ------ ------------------------- 
|AnonID|Query                    |
 ------ ------------------------- 
|142   |[Big, House, Green, Frog]|
 ------ ------------------------- 
  • Related