PySpark, save unique letters in strings of a column-CodePudding

I'm using PySpark, and I want a simple way of doing the next process without it being overcomplicated.

Suppose I have a table that looks like this:

ID	Letters
1	a,b,c,d
2	b,d,b
3	c,y,u

I want to get the unique letters in this dataframe from the column "Letters", this would be: List = [a,b,c,d,y,u].

I tried using the in operator, I don't really know how to iterate through each register, but I don't wanna make a mess because the original plan is for a big dataset.

CodePudding user response：

You can try something like this:

import pyspark.sql.functions as F

data1 = [
    [1, "a,b,c,d"],
    [2, "b,d,b"],
    [3, "c,y,u"],
]

df = spark.createDataFrame(data1).toDF("ID", "Letters")

dfWithDistinctValues = df.select(
    F.array_distinct(
        F.flatten(F.collect_set(F.array_distinct(F.split(df.Letters, ","))))
    ).alias("unique_letters")
)

defaultValues = [
    data[0] for data in dfWithDistinctValues.select("unique_letters").collect()
]

print(defaultValues)

What is happening here:

First i am splitting string by "," with F.split and droping duplicates at row level with F.array_distinct
I am using collect_set to get all distinct arrays into one row which is array of arrays at this stage and it looks like this:

[[b, d], [a, b, c, d], [c, y, u]]
Then i am using flatten to get all values as separate strings:

[b, d, a, b, c, d, c, y, u]
There are still some duplicates which are removed by array_distinct so at the end the output looks like this:

[b, d, a, c, y, u]

CodePudding user response：

Depending on how large your dataset and your arrays are (if they are very large, this might not be the route you want to take), you can use the explode function to easily get what you want:

from pyspark.sql.functions import explode

df = spark.createDataFrame(
    [
        (1, ["a", "b", "c", "d"]),
        (2, ["b", "d", "b"]),
        (3, ["c", "y", "u"])
    ],
    ["ID", "Letters"]
)

# Creating a dataframe with 1 column, "letters", with distinct values per row
uniqueLettersDf = df.select(explode("Letters").alias("letters")).distinct()

# Using list comprehension and the .collect() method to turn our dataframe into a Python list
output = [row['letters'] for row in uniqueLettersDf.collect()]

output
['d', 'c', 'b', 'a', 'y', 'u']