Home > Enterprise >  Using PYSPARK in DAE Databricks - How do I filter a comma separated field to find any one of a numbe
Using PYSPARK in DAE Databricks - How do I filter a comma separated field to find any one of a numbe

Time:07-13

At the moment I'm doing this in a PYSPARK query:

df_xyz = df_abc.filter( (df_abc.field1.like('%X85%))| (df_abc.field1.like('%X85%))| (df_abc.field1.like('%X87%)) )

Which is fine for when I'm filtering for 3 ICD10 codes.

But what I want to do is:
Take all codes from X85 - X99
Take all codes from Y0 - Y99

(I think) something like:

ICD10_codes = ((df_abc.field1.between %X85% - %X9%),(df_abc.field1.like%Y%))

df_xyz = df_abc.filter (df_abc.field1 in ICD10_codes)

Essentially: I want to assign a load of ICD10 codes to a variable and then use that variable in my filter query. Because I have to replicate the same code over different table names more than 2 or 3 times.

CodePudding user response:

You can create like conditions dynamically, looping through the list ICD10_codes:

from pyspark.sql import functions as F
df = spark.createDataFrame([('_X85',), ('X',)], ['field1'])

ICD10_codes = ['X85', 'X86', 'X87']

conditions = F.lit(False)
for x in ICD10_codes:
    conditions |= F.col('field1').like(f'%{x}%')

df.filter(conditions).show()
#  ------ 
# |field1|
#  ------ 
# |  _X85|
#  ------ 

You can also use rlike:

from pyspark.sql import functions as F
df = spark.createDataFrame([('_X85',), ('X',)], ['field1'])

ICD10_codes = ['X85', 'X86', 'X87']

df.filter(F.col('field1').rlike('|'.join(ICD10_codes))).show()
#  ------ 
# |field1|
#  ------ 
# |  _X85|
#  ------ 
  • Related