Home > Net >  Pandas / Pyspark Add Column to show duplicated values over condition
Pandas / Pyspark Add Column to show duplicated values over condition

Time:07-28

If like to add a column in my dataframe to mark if a condition is meet or not but I do not how to tackle the problem.

Lets suppose we have Students with the same name, over the dataframe, but I want to mark them down only if they have a grade over 5, and ignore the ones with grade under 5.

Like this :

Before:

https://i.stack.imgur.com/mpI1L.png

After:

https://i.stack.imgur.com/qPOEp.png

I can use Pandas or Pyspark as libraries.

CodePudding user response:

First, you calculate if a name surname has a mark over 5. This can be done by calculating the maximum mark for that name surname. Using that info, flag the records based on the condition.

data_sdf. \
    withColumn('name_max_mark', func.max('mark').over(wd.partitionBy('name', 'surname'))). \
    withColumn('dup_name', (func.col('name_max_mark') >= 5)). \
    show()

#  ---- ------- ---- ------------- -------- 
# |name|surname|mark|name_max_mark|dup_name|
#  ---- ------- ---- ------------- -------- 
# |Mark|    Doe|   3|            3|   false|
# |Mark|    Doe|   1|            3|   false|
# | Tom|    Doe|   6|            6|    true|
# | Tom|    Doe|   2|            6|    true|
#  ---- ------- ---- ------------- -------- 

select() only the columns you require, i.e. drop name_max_mark while storing to a variable.

CodePudding user response:

Input:

df = pd.DataFrame({
    "Name": ["Tom", "Mark", "Tom", "Mark", "Jim", "Jim"],
    "Surname": ["Doe"] * 6,
    "Mark": [6, 3, 2, 1, 2, 3]
})

Flag rows where values are greater than 5

df.loc[:, "Duplicated_Name"]  = (df.Mark > 5)

If one row is greater than 5, mark other Duplicated_Name to True

df.loc[df.Name.isin(df.loc[df["Duplicated_Name"] ==True, "Name"]), "Duplicated_Name"] = True

Result

   Name Surname  Mark   Duplicated_Name
0   Tom     Doe     6   True
1  Mark     Doe     3  False
2   Tom     Doe     2   True
3  Mark     Doe     1  False
4   Jim     Doe     2  False
5   Jim     Doe     3  False
  • Related