Home > other >  Check whether a data frame comma separated column value is present in a list and based on that creat
Check whether a data frame comma separated column value is present in a list and based on that creat

Time:08-09

I have below data frame -

Id  value
A   1,2,3
B   3,4
C   5,6

I want to check if the "value" are present in below list -

B=[1,3]

and based on that, I want to create below resultant data frame -

Id  value_inscope   value_outofscope
A      1,3                 2
B      3                   4
C                        5,6

I have tried filtering or looping through the data frame. Can anyone please suggest a step forward?

CodePudding user response:

You can construct an array from list B, and then use the array_intersect and array_except functions to find the intersection and difference of the two arrays respectively.

df = df.select(
    'id',
    F.array_join(
        F.array_intersect(F.split('value', ','), F.array(*[F.lit(e).cast('string') for e in B])),
        ','
    ).alias('value_inscope'),
    F.array_join(
        F.array_except(F.split('value', ','), F.array(*[F.lit(e).cast('string') for e in B])),
        ','
    ).alias('value_outofscope')
)
df.show(truncate=False)

#  --- ------------- ---------------- 
# |id |value_inscope|value_outofscope|
#  --- ------------- ---------------- 
# |A  |1,3          |2               |
# |B  |3            |4               |
# |C  |             |5,6             |
#  --- ------------- ---------------- 

CodePudding user response:

Sounds like you want to perform a union and a [symmetric] difference between two sets.

I've emboldened the word sets there because you will have to convert your list objects to set objects before you can perform the comparison.

Here's some code to illustrate against two scalar values:

input_list = [1, 3]
comparison_list = [1, 2, 3]

values_in_scope = set(input_list) & set(comparison_list)
values_out_of_scope = set(input_list) ^ set(comparison_list)

print(f"values_in_scope: {values_in_scope}")
print(f"values_out_of_scope : {values_out_of_scope }")

This outputs:

values_in_scope: {1, 3}
values_out_of_scope : {2}

Additional note: the outputs have {} instead of [] as they are set objects. You can use the list() function to convert them back to a list should you wish e.g. list(values_in_scope)

  • Related