Home > Back-end >  How to check if I have two values in an set of values of an table in Databricks Pyspark
How to check if I have two values in an set of values of an table in Databricks Pyspark

Time:11-15

Is there any way to change a column based on the presence of two values in a set of values from a databricks pyspark dataframe?

Example:

df = (
    [
        ('E1', 'A1',''), 
        ('E2', 'A2',''),
        ('F1', 'A3',''),
        ('F2', 'B1',''),
        ('F3', 'B2',''),
        ('G1', 'B3',''),
        ('G2', 'C1',''),
        ('G3', 'C2',''),
        ('G4', 'C3',''),
        ('H1', 'C4',''),
        ('H2', 'D1',''),
    ],
    ['old_comp_id', 'db_id', 'comment'] 
)

We check for the presence of the values E1 and C1, and we mark with a comment in both cases, and the expected result should be:

df = (
    [
        ('E1', 'A1','mark'), 
        ('E2', 'A2',''),
        ('F1', 'A3',''),
        ('F2', 'B1',''),
        ('F3', 'B2',''),
        ('G1', 'B3',''),
        ('G2', 'C1','mark'),
        ('G3', 'C2',''),
        ('G4', 'C3',''),
        ('H1', 'C4',''),
        ('H2', 'D1',''),
    ],
    ['old_comp_id', 'db_id', 'comment'] 
)

To be able to use multiple workers in Databricks I think it should only use the pyspark framework and not convert to Pandas at any moment.

CodePudding user response:

that's a simple case/when :

from pyspark.sql import functions as F

df = df.withColumn(
    "comment",
    F.when(F.col("old_comp_id") == "E1", "mark")
    .when(F.col("db_id") == "C1", "mark")
    .otherwise(""),
)
  • Related