At the moment I'm doing this in a PYSPARK query:
df_xyz = df_abc.filter( (df_abc.field1.like('%X85%))| (df_abc.field1.like('%X85%))| (df_abc.field1.like('%X87%)) )
Which is fine for when I'm filtering for 3 ICD10 codes.
But what I want to do is:
Take all codes from X85 - X99
Take all codes from Y0 - Y99
(I think) something like:
ICD10_codes = ((df_abc.field1.between %X85% - %X9%),(df_abc.field1.like%Y%))
df_xyz = df_abc.filter (df_abc.field1 in ICD10_codes)
Essentially: I want to assign a load of ICD10 codes to a variable and then use that variable in my filter query. Because I have to replicate the same code over different table names more than 2 or 3 times.
CodePudding user response:
You can create like
conditions dynamically, looping through the list ICD10_codes
:
from pyspark.sql import functions as F
df = spark.createDataFrame([('_X85',), ('X',)], ['field1'])
ICD10_codes = ['X85', 'X86', 'X87']
conditions = F.lit(False)
for x in ICD10_codes:
conditions |= F.col('field1').like(f'%{x}%')
df.filter(conditions).show()
# ------
# |field1|
# ------
# | _X85|
# ------
You can also use rlike
:
from pyspark.sql import functions as F
df = spark.createDataFrame([('_X85',), ('X',)], ['field1'])
ICD10_codes = ['X85', 'X86', 'X87']
df.filter(F.col('field1').rlike('|'.join(ICD10_codes))).show()
# ------
# |field1|
# ------
# | _X85|
# ------