Is it possible to remove rows if the values in the Block
column occurs at least twice which has different values in the ID
column?
My data looks like this:
ID | Block |
---|---|
1 | A |
1 | C |
1 | C |
3 | A |
3 | B |
In the above case, the value A
in the Block
column occurs twice, which has values 1 and 3 in the ID
column. So the rows are removed.
The expected output should be:
ID | Block |
---|---|
1 | C |
1 | C |
3 | B |
I tried to use the dropDuplicates
after the groupBy
, but I don't know how to filter with this type of condition. It appears that I would need a set
for the Block
column to check with the ID
column.
CodePudding user response:
One way to do it is using window functions. The first one (lag
) marks the row if it is different than the previous. The second (sum
) marks all "Block" rows for previously marked rows. Lastly, deleting roes and the helper (_flag
) column.
Input:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(1, 'A'),
(1, 'C'),
(1, 'C'),
(3, 'A'),
(3, 'B')],
['ID', 'Block'])
Script:
w1 = W.partitionBy('Block').orderBy('ID')
w2 = W.partitionBy('Block')
grp = F.when(F.lag('ID').over(w1) != F.col('ID'), 1).otherwise(0)
df = df.withColumn('_flag', F.sum(grp).over(w2) == 0) \
.filter('_flag').drop('_flag')
df.show()
# --- -----
# | ID|Block|
# --- -----
# | 3| B|
# | 1| C|
# | 1| C|
# --- -----
CodePudding user response:
Use window functions. get ranks per group of blocks and through away any rows that rank higher than 1. Code below
(df.withColumn('index', row_number().over(Window.partitionBy().orderBy('ID','Block')))#create an index to reorder after comps
.withColumn('BlockRank', rank().over(Window.partitionBy('Block').orderBy('ID'))).orderBy('index')#Rank per Block
.where(col('BlockRank')==1)
.drop('index','BlockRank')
).show()
--- -----
| ID|Block|
--- -----
| 1| A|
| 1| C|
| 1| C|
| 3| B|
--- -----