I have an Array column in PySpark which looks something like this
Parsed
[/level1/level2, /level2/level3/level4, /level1/level2]
[/level1/level2, /level2/level3]
[/level1/level2, /level1/level2/level3, /level1/level2]
[/level1/level2, /level1/level2/level3, /level1/level2]
I want something like this
Parsed
[level1, level2, level1]
[level1, level2]
[level1, level1, level1]
Sample row
Parsed
[/president/vicepresident/seniormanager, /president/seniormanager, /manager/teamlead]
Expected output
Parsed
[ president, president, manager]
Basically, I want only the first level in each element of array
My approach
df.filter(col("Parsed")
.rlike("/Level/")
).show()
CodePudding user response:
You can use pd.DataFrame.explode
to explode the rows containing list of string
then using pd.Series.str.split
you can get the first data and then we can group the data back on index by using pd.DataFrame.groupby
.
dict_ = { 'levels' : [['/level1/level2', '/level2/level3/level4', '/level1/level2'],
['/level1/level2', '/level2/level3'],
['/level1/level2', '/level1/level2/level3', '/level1/level2'],
['/level1/level2', '/level1/level2/level3', '/level1/level2']]}
df = pd.DataFrame(dict_)
df['levels'].explode().str.split('/', expand=True)[1].reset_index().groupby('index')[1].apply(list)
Output
which gives us the expected output
0 [level1, level2, level1]
1 [level1, level2]
2 [level1, level1, level1]
3 [level1, level1, level1]
Name: 1, dtype: object
CodePudding user response:
Here's an approach in pyspark.
# input data
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['arr_col'])
# -------------------------------------------------------
# |arr_col |
# -------------------------------------------------------
# |[/level1/level2, /level2/level3/level4, /level1/level2]|
# |[/level1/level2, /level2/level3] |
# |[/level1/level2, /level1/level2/level3, /level1/level2]|
# |[/level1/level2, /level1/level2/level3, /level1/level2]|
# -------------------------------------------------------
data_sdf. \
withColumn('rn', func.row_number().over(wd.orderBy(func.lit(1)))). \
withColumn('arr_col_exploded', func.explode('arr_col')). \
withColumn('arr_col_exp_split', func.split('arr_col_exploded', '\/')). \
withColumn('first_element', func.col('arr_col_exp_split')[1]). \
groupBy('rn', 'arr_col'). \
agg(func.collect_list('first_element').alias('first_element_arr')). \
show(truncate=False)
# --- ------------------------------------------------------- ------------------------
# |rn |arr_col |first_element_arr |
# --- ------------------------------------------------------- ------------------------
# |1 |[/level1/level2, /level2/level3/level4, /level1/level2]|[level1, level2, level1]|
# |2 |[/level1/level2, /level2/level3] |[level1, level2] |
# |3 |[/level1/level2, /level1/level2/level3, /level1/level2]|[level1, level1, level1]|
# |4 |[/level1/level2, /level1/level2/level3, /level1/level2]|[level1, level1, level1]|
# --- ------------------------------------------------------- ------------------------
The rn
is to help in grouping, if there are duplicate input arrays. The idea is to explode
the input array and then split
the exploded elements which creates an array of the elements that were delimited by '/'
. Once split, we can pull out the second element (which is actually the first element) as the first will be a null (due to the first '/'
). Finally, use collect_list
to create an array of the first elements.