If like to add a column in my dataframe to mark if a condition is meet or not but I do not how to tackle the problem.
Lets suppose we have Students with the same name, over the dataframe, but I want to mark them down only if they have a grade over 5, and ignore the ones with grade under 5.
Like this :
Before:
https://i.stack.imgur.com/mpI1L.png
After:
https://i.stack.imgur.com/qPOEp.png
I can use Pandas or Pyspark as libraries.
CodePudding user response:
First, you calculate if a name surname
has a mark over 5. This can be done by calculating the maximum mark for that name surname
. Using that info, flag the records based on the condition.
data_sdf. \
withColumn('name_max_mark', func.max('mark').over(wd.partitionBy('name', 'surname'))). \
withColumn('dup_name', (func.col('name_max_mark') >= 5)). \
show()
# ---- ------- ---- ------------- --------
# |name|surname|mark|name_max_mark|dup_name|
# ---- ------- ---- ------------- --------
# |Mark| Doe| 3| 3| false|
# |Mark| Doe| 1| 3| false|
# | Tom| Doe| 6| 6| true|
# | Tom| Doe| 2| 6| true|
# ---- ------- ---- ------------- --------
select()
only the columns you require, i.e. drop name_max_mark
while storing to a variable.
CodePudding user response:
Input:
df = pd.DataFrame({
"Name": ["Tom", "Mark", "Tom", "Mark", "Jim", "Jim"],
"Surname": ["Doe"] * 6,
"Mark": [6, 3, 2, 1, 2, 3]
})
Flag rows where values are greater than 5
df.loc[:, "Duplicated_Name"] = (df.Mark > 5)
If one row is greater than 5, mark other Duplicated_Name to True
df.loc[df.Name.isin(df.loc[df["Duplicated_Name"] ==True, "Name"]), "Duplicated_Name"] = True
Result
Name Surname Mark Duplicated_Name
0 Tom Doe 6 True
1 Mark Doe 3 False
2 Tom Doe 2 True
3 Mark Doe 1 False
4 Jim Doe 2 False
5 Jim Doe 3 False