Home > Net >  Find the value in intervals arrays in pyspark
Find the value in intervals arrays in pyspark

Time:01-31

I have the follow df:

state   interval      limits       cep_ori
PA     >=5, <=10      >10    [{111, 222}, {333, 444}]
SP     >=6, <=10      <8     [{333, 444}, {555, 666}]

Follow the Schema:

root
 |-- state: string (nullable = true)
 |-- interval: string (nullable = true)
 |-- limits: string (nullable = true)
 |-- cep_ori: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- inicial: integer (nullable = true)
 |    |    |-- final: integer (nullable = true)

I have another df (df2):

seller_id  cep
12222      114
33332      43
33344      338

I need to inner join if the field 'cep' contains in any interval (inicial, final) of field 'cep_ori'.

So, the df_final I need:

state   interval      limits       cep_ori              seller_id
PA     >=5, <=10      >10    [{111, 222}, {333, 444}]     12222
PA     >=5, <=10      >10    [{111, 222}, {333, 444}]     33344
SP     >=6, <=10      <8     [{333, 444}, {555, 666}]     33344

I tried:

df.filter(sf.size(sf.array_intersect(df.cep_ori.inicial, sf.array(df2.cep)) != 0))
.filter(sf.size(sf.array_intersect(df.cep_ori.final,sf.array(df2.cep)) != 0))

But, don't worked. Can someone help me?

CodePudding user response:

You can convert cep_ori column into array which includes all integer values in mentioned range by using transform and sequence functions. Then join with second dataframe and filter the values where cep exists in the range specified in cep_ori column.

df1 = # first dataframe
df1=df1.withColumn("cep_ori_array", expr("flatten(transform(cep_ori, x->sequence(x.inicial,x.final)))"))
df2=spark.createDataFrame([(12222,114),(33332,43),(33344,338)],['seller_id', 'cep'])
df2.join(df1).where(array_contains("cep_ori_array", col("cep").cast("bigint"))==True).drop("cep_ori_array").show(truncate=False)

 --------- --- ------------------------ --------- ------ ----- 
|seller_id|cep|cep_ori                 |interval |limits|state|
 --------- --- ------------------------ --------- ------ ----- 
|12222    |114|[{222, 111}, {444, 333}]|>=5, <=10|>10   |PA   |
|33344    |338|[{222, 111}, {444, 333}]|>=5, <=10|>10   |PA   |
|33344    |338|[{444, 333}, {666, 555}]|>=6, <=10|<8    |SP   |
 --------- --- ------------------------ --------- ------ ----- 
  • Related