I have the follow df:
state interval limits cep_ori
PA >=5, <=10 >10 [{111, 222}, {333, 444}]
SP >=6, <=10 <8 [{333, 444}, {555, 666}]
Follow the Schema:
root
|-- state: string (nullable = true)
|-- interval: string (nullable = true)
|-- limits: string (nullable = true)
|-- cep_ori: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- inicial: integer (nullable = true)
| | |-- final: integer (nullable = true)
I have another df (df2):
seller_id cep
12222 114
33332 43
33344 338
I need to inner join if the field 'cep' contains in any interval (inicial, final) of field 'cep_ori'.
So, the df_final I need:
state interval limits cep_ori seller_id
PA >=5, <=10 >10 [{111, 222}, {333, 444}] 12222
PA >=5, <=10 >10 [{111, 222}, {333, 444}] 33344
SP >=6, <=10 <8 [{333, 444}, {555, 666}] 33344
I tried:
df.filter(sf.size(sf.array_intersect(df.cep_ori.inicial, sf.array(df2.cep)) != 0))
.filter(sf.size(sf.array_intersect(df.cep_ori.final,sf.array(df2.cep)) != 0))
But, don't worked. Can someone help me?
CodePudding user response:
You can convert cep_ori column into array which includes all integer values in mentioned range by using transform
and sequence
functions.
Then join with second dataframe and filter the values where cep exists in the range specified in cep_ori column.
df1 = # first dataframe
df1=df1.withColumn("cep_ori_array", expr("flatten(transform(cep_ori, x->sequence(x.inicial,x.final)))"))
df2=spark.createDataFrame([(12222,114),(33332,43),(33344,338)],['seller_id', 'cep'])
df2.join(df1).where(array_contains("cep_ori_array", col("cep").cast("bigint"))==True).drop("cep_ori_array").show(truncate=False)
--------- --- ------------------------ --------- ------ -----
|seller_id|cep|cep_ori |interval |limits|state|
--------- --- ------------------------ --------- ------ -----
|12222 |114|[{222, 111}, {444, 333}]|>=5, <=10|>10 |PA |
|33344 |338|[{222, 111}, {444, 333}]|>=5, <=10|>10 |PA |
|33344 |338|[{444, 333}, {666, 555}]|>=6, <=10|<8 |SP |
--------- --- ------------------------ --------- ------ -----