I have the following PySpark dataframe
A |
---|
1001 |
1110 |
1000 |
I want to find all occurrences of 1
in this column and put it into a new column in some way like this:
A | Occurrences |
---|---|
1001 | 0,3 |
1110 | 0,1,2 |
1000 | 0 |
I tried the locate
function as follows:
from pyspark.sql import functions as F
dfa_occ = dfa.withColumn('Occurrences', F.locate('1', (F.col('A'))-1) )
However this only extracts the first occurrence of 1
, while I need to find all of them!
CodePudding user response:
You could try posexplode
and the aggregate result back into a list. split
in this example splits the string into letters.
from pyspark.sql import functions as F
df = spark.createDataFrame([('1001',), ('1110',), ('1000',)], ['A'])
df = (df
.select('*', F.posexplode(F.split('A', '(?!$)')))
.filter('col = 1')
.groupBy('A')
.agg(F.collect_list('pos').alias('Occurrences'))
)
df.show()
# ---- -----------
# | A|Occurrences|
# ---- -----------
# |1001| [0, 3]|
# |1000| [0]|
# |1110| [0, 1, 2]|
# ---- -----------
CodePudding user response:
Here's a way using array's higher-order functions transform
and filter
:
from pyspark.sql import functions as F
result = df.withColumn(
"Occurrences",
F.filter(
F.transform(F.split("A", ""), lambda x, i: F.when(x == 1, i)),
lambda x: x.isNotNull()
)
)
result.show()
# ---- -----------
# | A|Occurrences|
# ---- -----------
# |1001| [0, 3]|
# |1110| [0, 1, 2]|
# |1000| [0]|
# ---- -----------