I would like to change value in one specific row based on value from other columns in the same row. I tried with method .withColumn()
but it doesn't work as I wanted. What I want to get - set flag 1
when the object has date in dataframe just like in the given dictionary:
objects_ = {'x': [datetime.date(2010, 1, 1), datetime.date(2012, 1, 9), datetime.date(2012, 11, 1)], 'y': [datetime.date(2010, 5, 23), datetime.date(2002, 4, 3)]}
date | flag | obj |
---|---|---|
2010-01-01 | 0 | x |
2010-02-03 | 0 | x |
2010-02-04 | 0 | x |
2010-05-23 | 0 | y |
2010-10-13 | 0 | y |
to
date | flag | obj |
---|---|---|
2010-01-01 | 1 | x |
2010-02-03 | 0 | x |
2010-02-04 | 0 | x |
2010-05-23 | 1 | y |
2010-10-13 | 0 | y |
Dates are datetime.date type and dataframe is pyspark's.
Flag for date 2010-05-23
would be 1
only for object y
(not for x
too) as its listed for y
key in dictionary.
I've made something like this, but I failed. Perhaps, there is some method to avoid the loops?
for x in x_list:
for y in x_dict[x]:
df = df.withColumn("flag", when(df["date"] == y, 1).otherwise(df["flag"]))
CodePudding user response:
EDIT: This is the solution for a Pandas dataframe. Missed that OP was using pyspark.
I don't know if this is the proper way to do this, but in cases with relatively few matching conditions, this works well. I don't think the dictionary is providing much value as you are using it as key1:[keyA,keyB]
rather than key1A:value, key1B:value
. If you only have a single value to match to many keys, it's much easier to use .isin()
, reduce your matching dictionary to a list with single entries and match to a new column with your 2 keys also merged into a single entry.
match_list = [str(v) str(key) for key, value in objects_.items() for v in value]
DB = [Your dataframe]
DB["Merge"] = DB["Date"].astype(str) DB["obj"]
DB.loc[DB["Merge"].isin(match_list)==True,"flag"] = 1
DB.drop(columns=["Merge"],inplace=True)
Hope this helps.
CodePudding user response:
you can create case whens for each of the keys and then pass them to a coalesce
.
here's the approach
mapper = [func.when((func.col('obj') == k) & func.col('date').isin(v), 1) for k, v in objects_.items()]
# [Column<'CASE WHEN ((obj = x) AND (date IN (DATE '2010-01-01', DATE '2012-01-09', DATE '2012-11-01'))) THEN 1 ELSE obj END'>,
# Column<'CASE WHEN ((obj = y) AND (date IN (DATE '2010-05-23', DATE '2002-04-03'))) THEN 1 ELSE obj END'>]
data_sdf. \
withColumn('flag_new', func.coalesce(*mapper, 'flag')). \
show()
# ---------- ---- --- --------
# | date|flag|obj|flag_new|
# ---------- ---- --- --------
# |2010-01-01| 0| x| 1|
# |2010-02-03| 0| x| 0|
# |2010-02-04| 0| x| 0|
# |2010-05-23| 0| y| 1|
# |2010-10-13| 0| y| 0|
# ---------- ---- --- --------