Home > Mobile >  Nested Row Logic - Pyspark Dataframe
Nested Row Logic - Pyspark Dataframe

Time:12-05

I have a dataframe like this. I want to get "mert" ones in Names column and the Index of the Status column to the Name column is Active. I transform this from json but I can't filter. how can I do it?

 --- ----------------------------------------- ------------------ 
|ID |Names                                    | Status           |
 --- ----------------------------------------- ------------------ 
|1  |[[[aaaa, mert], [cccc, Doe]]]            | [Active, Active] |
|2  |[[[aaa, Michael], [ggg, ]]]              | [Active, Active] |
|3  |[[[cccc, mert], [gg, Merk  ]]]           | [Suspend, Active]|
|3  |[[[dddd, Angela], [fggg, Merl]]]         | [Active, Suspend]|
 --- ----------------------------------------- ------------------ 

CodePudding user response:

In PySpark, you can filter the dataframe by using the filter method. You can use this method to keep only the rows where the "Name" column contains "mert" and the corresponding index in the "Status" column is "Active". Here is an example of how you can do this:

from pyspark.sql.functions import col

# Filter the dataframe to only keep rows where "Name" contains "mert" and "Status" is "Active"
df_filtered = df.filter((col("Names").like("%mert%")) & (col("Status") == "Active"))

CodePudding user response:

It is not clear if your data type is arrays or string. From the problem context it looks like array.

If it is array, then:

  • remove outer layers of arrays with explode() (twice)
  • zip "Names" & "Status" using arrays_zip() (so it can be refered by same index)
  • filter record if array contains required value
df = spark.createDataFrame(data=[[1,[[["aaaa","mert"],["cccc","Doe"]]],["Active","Active"]],[2,[[["aaa","Michael"],["ggg",""]]],["Active","Active"]],[3,[[["cccc","mert"],["gg","Merk  "]]],["Suspend","Active"]],[4,[[["dddd","Angela"],["fggg","Merl"]]],["Active","Suspend"]]], schema=["ID","Names","Status"])

df = df.withColumn("Names2", F.explode("Names")) \
       .withColumn("Names2", F.explode("Names2")) \
       .withColumn("Names_Status", F.arrays_zip("Names2", "Status")) \
       .filter((F.array_contains(F.col("Names_Status").getField("Names2"), "mert")) \
               & (F.array_contains(F.col("Names_Status").getField("Status"), "Active"))) \
       .drop("Names2", "Names_Status")

[Out]:
 --- ------------------------------ ----------------- 
|ID |Names                         |Status           |
 --- ------------------------------ ----------------- 
|1  |[[[aaaa, mert], [cccc, Doe]]] |[Active, Active] |
|3  |[[[cccc, mert], [gg, Merk  ]]]|[Suspend, Active]|
 --- ------------------------------ ----------------- 
  • Related