I have dataframe like this:
id feature value
a aa 0.5
b ab 0.1
a ab 0.2
a cc 0.3
c ab 0.9
b bb 1
I have let's say total 4 unique values in feature column and all id might not correspond to all feature, so I want to another dataframe where id will correspond to value of feature it has or else 0 if that feature is not present.
e.g.:
feature_list = ['aa', 'ab', 'cc', 'bb']
id feature_vector
a [0.5, 0.2, 0.3, 0]
b [0, 0.1, 0, 1]
c [0, 0.9, 0, 0]
CodePudding user response:
You can achieve the expected result by pivoting and then using to_array to select the features.
See the code below,
sdf_pivoted = sdf \
.groupby("id") \
.pivot("feature") \
.agg(sf.first("value")) \
.fillna(0.0)
sdf_pivoted.show()
--- --- --- --- ---
| id| aa| ab| bb| cc|
--- --- --- --- ---
| c|0.0|0.9|0.0|0.0|
| a|0.5|0.2|0.0|0.3|
| b|0.0|0.1|1.0|0.0|
--- --- --- --- ---
l_cols = [c for c in sdf_pivoted.columns if c != 'id']
sdf_pivoted \
.select("id", sf.array(*l_cols)) \
.show()
--- ---------------------
| id|array(aa, ab, bb, cc)|
--- ---------------------
| c| [0.0, 0.9, 0.0, 0.0]|
| a| [0.5, 0.2, 0.0, 0.3]|
| b| [0.0, 0.1, 1.0, 0.0]|
--- ---------------------
You can rename the column as you need.
CodePudding user response:
You can create a map and pull values from it.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('a', 'aa', 0.5),
('b', 'ab', 0.1),
('a', 'ab', 0.2),
('a', 'cc', 0.3),
('c', 'ab', 0.9),
('b', 'bb', 1.0)],
['id', 'feature', 'value'])
feature_list = ['aa', 'ab', 'cc', 'bb']
df = df.groupBy('id').agg(F.map_from_entries(F.collect_set(F.struct('feature', 'value'))).alias('map'))
df = df.withColumn('arr', F.array([F.lit(x) for x in feature_list]))
df = df.select('id', F.expr("transform(arr, x -> coalesce(map[x], 0)) feature_vector"))
df.show()
# --- --------------------
# | id| feature_vector|
# --- --------------------
# | c|[0.0, 0.9, 0.0, 0.0]|
# | b|[0.0, 0.1, 0.0, 1.0]|
# | a|[0.5, 0.2, 0.3, 0.0]|
# --- --------------------