Home > Net >  PySpark dynamic like operations
PySpark dynamic like operations

Time:06-18

I have a PySpark dataframe like this:

data = [{"ID": 1, "Value": 3478},
        {"ID": 2, "Value": 10},
        {"ID": 3, "Value": 3323},
        {"ID": 1, "Value": 2300},
        {"ID": 2, "Value": 40},
        {"ID": 3, "Value": 93},
        {"ID": 1, "Value": 500},
        {"ID": 2, "Value": 50},
        {"ID": 3, "Value": 73}]

df = spark.createDataFrame(data)
df.show(20, False)
#  --- -----                                                                      
# |ID |Value|
#  --- ----- 
# |1  |3478 |
# |2  |10   |
# |3  |3323 |
# |1  |2300 |
# |2  |40   |
# |3  |93   |
# |1  |500  |
# |2  |50   |
# |3  |73   |
#  --- ----- 

I have another Spark dataframe like this:

final_data = [{"ID": 1, "Value": 1234563478},
              {"ID": 2, "Value": 2134510},
              {"ID": 3, "Value": 789033323}]

final_df = spark.createDataFrame(final_data)
final_df.show(5, False)
#  --- ----------                                                                 
# |ID |Value     |
#  --- ---------- 
# |1  |1234563478|
# |2  |2134510   |
# |3  |789033323 |
#  --- ---------- 

Now my requirement is to filer final_df based on df values.

from pyspark.sql import functions as F

final_df.filter(F.col("ID") == "1").where(F.col("Value").like('478%') | F.col("ID").like('#00%') | F.col("ID").like('P0%')).show(5, False)
#  --- ----------                                                                 
# |ID |Value     |
#  --- ---------- 
# |1  |1234563478|
#  --- ---------- 

Here from the df values I have manually passed as value to like condition. Is there a dynamic way to call the value of df in final_df. Today I have only 3 values for matching id == 1, tomorrow I might have 10 values. How to set dynamically like conditions in PySpark dataframe?

CodePudding user response:

Not sure what you really expect. Lets try use lists and fstring if you are after declaring variables and having them dynamically picked. If not what you are after, let me know.

s='3478'
s1 ='2300'
s2 ='500'
lst=['1']
from pyspark.sql import functions as F

final_df.filter(F.col("ID").isin(lst)).where(F.col("Value").like(f'%{s}%') | F.col("ID").like(f'%{s1}%')| F.col("ID").like(f'%{s2}%')).show(5,False)

CodePudding user response:

You may need some python to create a filter with many dynamic conditions. I assume your df is relatively small, so I use collect

filter_list = df.groupBy('ID').agg(F.collect_set('Value').alias('vals')).collect()
row_filter = []
for r in filter_list:
    s = f"ID={r['ID']} AND "
    s  = '('   ' OR '.join([f"Value LIKE '%{v}%'" for v in r['vals']])   ')'
    row_filter.append(s)
final_df.filter(f"({') OR ('.join(row_filter)})")
  • Related