Create a duplicate fields that counts duplicate rows-CodePudding

I have the following dataframe:

I would like to create a column called "fl_dup" that shows the value '0' in case the row is unique or when it occurs for the first time. On the contrary, it should show the value '1' when the row is duplicated and occurs the second time onwards. Ideally the fl_dup column would look like this:

    A  B  FL_DUP      
0   1  1  0
1   1  2  0
2   1  1  1
3   1  1  1
4   2  2  0

I tried with this code, but unfortunately sometimes the cast doesn't work and returns 'null' values. I also can't get the '0' value for duplicate rows that appear for the first time.

  df2 = df.join(
    df.groupBy(df.columns).agg((f.count("*")>1).cast("int").alias("FL_DUP")),
    on=df.columns,
    how="left"
  )

CodePudding user response：

The following way, since you have no column for ordering rows), the order may be lost:

from pyspark.sql import functions as F, Window as W

df = spark.createDataFrame(
    [(1, 1),
     (1, 2),
     (1, 1),
     (1, 1), 
     (2, 2)],
    ['A', 'B']
)

w = W.partitionBy('A', 'B').orderBy('A')
df = df.withColumn('fl_dup', F.when(F.row_number().over(w) == 1, 0).otherwise(1))

df.show()
#  --- --- ------ 
# |  A|  B|fl_dup|
#  --- --- ------ 
# |  1|  1|     0|
# |  1|  1|     1|
# |  1|  1|     1|
# |  1|  2|     0|
# |  2|  2|     0|
#  --- --- ------

CodePudding user response：

This should do what you are asking for:

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,1],[1,2],[1,1],[1,1],[2,2]], columns=("A", "B"))
df['FL_DUP'] = np.where(df.duplicated(['A', 'B'], keep='first'), 1, 0)

Outputs:

   A  B  FL_DUP
0  1  1       0
1  1  2       0
2  1  1       1
3  1  1       1
4  2  2       0

See pandas.DataFrame.duplicated and numpy.where for more info.

CodePudding user response：

Create a column with all the values zero and add the column into the data frame
update the value of the columns with duplicate rows to 1

In[0]:
df.insert(2,"fl_dup", list(np.zeros(df.shape[0], dtype = int)), True)
df.loc[df.duplicated(), 'fl_dup'] = '1'
df

out[1]:

    A   B   fl_dup
0   1   1   0
1   1   2   0
2   1   1   1
3   1   1   1
4   2   2   0