Is there any way to change a column based on the presence of two values in a set of values from a databricks pyspark dataframe?
Example:
df = (
[
('E1', 'A1',''),
('E2', 'A2',''),
('F1', 'A3',''),
('F2', 'B1',''),
('F3', 'B2',''),
('G1', 'B3',''),
('G2', 'C1',''),
('G3', 'C2',''),
('G4', 'C3',''),
('H1', 'C4',''),
('H2', 'D1',''),
],
['old_comp_id', 'db_id', 'comment']
)
We check for the presence of the values E1
and C1,
and we mark with a comment in both cases, and the expected result should be:
df = (
[
('E1', 'A1','mark'),
('E2', 'A2',''),
('F1', 'A3',''),
('F2', 'B1',''),
('F3', 'B2',''),
('G1', 'B3',''),
('G2', 'C1','mark'),
('G3', 'C2',''),
('G4', 'C3',''),
('H1', 'C4',''),
('H2', 'D1',''),
],
['old_comp_id', 'db_id', 'comment']
)
To be able to use multiple workers in Databricks I think it should only use the pyspark framework and not convert to Pandas at any moment.
CodePudding user response:
that's a simple case/when :
from pyspark.sql import functions as F
df = df.withColumn(
"comment",
F.when(F.col("old_comp_id") == "E1", "mark")
.when(F.col("db_id") == "C1", "mark")
.otherwise(""),
)