Home > Software engineering >  How to change specific value in a row, using dictionary and conditions?
How to change specific value in a row, using dictionary and conditions?

Time:12-22

I would like to change value in one specific row based on value from other columns in the same row. I tried with method .withColumn() but it doesn't work as I wanted. What I want to get - set flag 1 when the object has date in dataframe just like in the given dictionary:

objects_ = {'x': [datetime.date(2010, 1, 1), datetime.date(2012, 1, 9), datetime.date(2012, 11, 1)], 'y': [datetime.date(2010, 5, 23), datetime.date(2002, 4, 3)]}
date flag obj
2010-01-01 0 x
2010-02-03 0 x
2010-02-04 0 x
2010-05-23 0 y
2010-10-13 0 y

to

date flag obj
2010-01-01 1 x
2010-02-03 0 x
2010-02-04 0 x
2010-05-23 1 y
2010-10-13 0 y

Dates are datetime.date type and dataframe is pyspark's.

Flag for date 2010-05-23 would be 1 only for object y (not for x too) as its listed for y key in dictionary.

I've made something like this, but I failed. Perhaps, there is some method to avoid the loops?

for x in x_list:
    for y in x_dict[x]:
        df = df.withColumn("flag", when(df["date"] == y, 1).otherwise(df["flag"]))

CodePudding user response:

EDIT: This is the solution for a Pandas dataframe. Missed that OP was using pyspark.

I don't know if this is the proper way to do this, but in cases with relatively few matching conditions, this works well. I don't think the dictionary is providing much value as you are using it as key1:[keyA,keyB] rather than key1A:value, key1B:value. If you only have a single value to match to many keys, it's much easier to use .isin(), reduce your matching dictionary to a list with single entries and match to a new column with your 2 keys also merged into a single entry.

match_list = [str(v) str(key) for key, value in objects_.items() for v in value]
DB = [Your dataframe]
DB["Merge"] = DB["Date"].astype(str) DB["obj"]
DB.loc[DB["Merge"].isin(match_list)==True,"flag"] = 1
DB.drop(columns=["Merge"],inplace=True)

Hope this helps.

CodePudding user response:

you can create case whens for each of the keys and then pass them to a coalesce.

here's the approach

mapper = [func.when((func.col('obj') == k) & func.col('date').isin(v), 1) for k, v in objects_.items()]
# [Column<'CASE WHEN ((obj = x) AND (date IN (DATE '2010-01-01', DATE '2012-01-09', DATE '2012-11-01'))) THEN 1 ELSE obj END'>,
#  Column<'CASE WHEN ((obj = y) AND (date IN (DATE '2010-05-23', DATE '2002-04-03'))) THEN 1 ELSE obj END'>]

data_sdf. \
    withColumn('flag_new', func.coalesce(*mapper, 'flag')). \
    show()

#  ---------- ---- --- -------- 
# |      date|flag|obj|flag_new|
#  ---------- ---- --- -------- 
# |2010-01-01|   0|  x|       1|
# |2010-02-03|   0|  x|       0|
# |2010-02-04|   0|  x|       0|
# |2010-05-23|   0|  y|       1|
# |2010-10-13|   0|  y|       0|
#  ---------- ---- --- -------- 
  • Related