i have the following pyspark dataframe:
----------------------
| Paths |
----------------------
|[link1, link2, link3] |
|[link1, link2, link4] |
|[link1, link2, link3] |
|[link1, link2, link4] |
...
..
.
----------------------
I want to encode the paths into a categorical variable and add this information to the dataframe. The result should be something like this:
---------------------- ----------------------
| Paths | encodedPaths |
---------------------- ----------------------
|[link1, link2, link3] | 1 |
|[link1, link2, link4] | 2 |
|[link1, link2, link3] | 1 |
|[link1, link2, link4] | 2 |
...
..
.
----------------------
Looking around i found this solution:
indexer = pathsDF.select("Paths").distinct().withColumn("encodedPaths", F.monotonically_increasing_id())
pathsDF = pathsDF.join(indexer, "Paths")
It should work but the number of distinct paths is not the same among the original and the resulting dataframe. In addition to that some values in the encoded column are significantly higher than the number of distinct paths. This should not be possible since the monotonically_increasing function should increment linearly. Do you have other solutions?
CodePudding user response:
You can use StringIndexer from ml - lib after casting the array column to string:
from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol="PathsStr", outputCol="encodedPaths")
df2 = df.withColumn("PathsStr",F.col("Paths").cast("string"))
#or df2 = df.withColumn("PathsStr",F.concat_ws(",","Paths"))
out = stringIndexer.fit(df2).transform(df2)\
.withColumn("encodedPaths",F.col("encodedPaths") 1)\
.select(*df.columns,"encodedPaths")
out.show(truncate=False)
--------------------- ------------
|Paths |encodedPaths|
--------------------- ------------
|[link1, link2, link3]|1.0 |
|[link1, link2, link4]|2.0 |
|[link1, link2, link3]|1.0 |
|[link1, link2, link4]|2.0 |
--------------------- ------------