Get vector with values if present else 0 from column-CodePudding

I have dataframe like this:

id    feature    value
a       aa        0.5
b       ab        0.1
a       ab        0.2
a       cc        0.3
c       ab        0.9
b       bb         1

I have let's say total 4 unique values in feature column and all id might not correspond to all feature, so I want to another dataframe where id will correspond to value of feature it has or else 0 if that feature is not present.

e.g.:

feature_list = ['aa', 'ab', 'cc', 'bb']

id   feature_vector
a    [0.5, 0.2, 0.3, 0]
b    [0, 0.1, 0, 1]
c    [0, 0.9, 0, 0]

CodePudding user response：

You can achieve the expected result by pivoting and then using to_array to select the features.

See the code below,

sdf_pivoted = sdf \
    .groupby("id") \
    .pivot("feature") \
    .agg(sf.first("value")) \
    .fillna(0.0)

sdf_pivoted.show()

 --- --- --- --- --- 
| id| aa| ab| bb| cc|
 --- --- --- --- --- 
|  c|0.0|0.9|0.0|0.0|
|  a|0.5|0.2|0.0|0.3|
|  b|0.0|0.1|1.0|0.0|
 --- --- --- --- --- 

l_cols = [c for c in sdf_pivoted.columns if c != 'id']

sdf_pivoted \
    .select("id", sf.array(*l_cols)) \
    .show()


 --- --------------------- 
| id|array(aa, ab, bb, cc)|
 --- --------------------- 
|  c| [0.0, 0.9, 0.0, 0.0]|
|  a| [0.5, 0.2, 0.0, 0.3]|
|  b| [0.0, 0.1, 1.0, 0.0]|
 --- ---------------------

You can rename the column as you need.

CodePudding user response：

You can create a map and pull values from it.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('a', 'aa', 0.5),
     ('b', 'ab', 0.1),
     ('a', 'ab', 0.2),
     ('a', 'cc', 0.3),
     ('c', 'ab', 0.9),
     ('b', 'bb', 1.0)],
    ['id', 'feature', 'value'])

feature_list = ['aa', 'ab', 'cc', 'bb']

df = df.groupBy('id').agg(F.map_from_entries(F.collect_set(F.struct('feature', 'value'))).alias('map'))
df = df.withColumn('arr', F.array([F.lit(x) for x in feature_list]))
df = df.select('id', F.expr("transform(arr, x -> coalesce(map[x], 0)) feature_vector"))

df.show()
#  --- -------------------- 
# | id|      feature_vector|
#  --- -------------------- 
# |  c|[0.0, 0.9, 0.0, 0.0]|
# |  b|[0.0, 0.1, 0.0, 1.0]|
# |  a|[0.5, 0.2, 0.3, 0.0]|
#  --- --------------------