How to filter out a dataframe based on specific condition?-CodePudding

Scenario: If a number has two records, one with del and one with undel reason then only del will be selected. If a number has only one record with undel or del then it will also be selected.

Example: I have a dataframe with 2 columns as below:

number	reason
1234	del
1234	undel
4567	del
6789	undel
2423	del
2423	undel

My expected output is as below:

number	reason	tofilter
1234	del	1
1234	undel	0
4567	del	1
6789	undel	1
2423	del	1
2423	undel	0

Here I want to filter out only those records where tofilter is 1.

CodePudding user response：

You can generate your expected output from the dataframe you provided with this, assuming that there are only two samples in case of duplicated numbers.

data = {'number': ['1234', '1234', '4567', '6789', '2423', '2423'],
    'reason': ['del', 'undel', 'del', 'undel', 'del', 'undel']} 
df = pd.DataFrame(data, columns=['number', 'reason'])

unique_vals = df.number.value_counts().to_dict()

for index, item in enumerate(df.iterrows()):
    nbr = item[1]['number']
    filter_check = unique_vals[nbr]
    indexes = df[ df['number'] == nbr ]['reason'].index.to_list()
    if filter_check == 2:
        if 'to_filter' not in df.columns:
             df.insert(2, "to_filter", pd.Series([1, 0], index=[idx for idx in indexes]), allow_duplicates= False)
        else:
            df['to_filter'].iloc[indexes[0]] = 1
            df['to_filter'].iloc[indexes[1]] = 0
        
    elif filter_check == 1:
        df['to_filter'].iloc[indexes[0]] = 1

which outputs:

number  reason  to_filter
0   1234    del 1.0
1   1234    undel   0.0
2   4567    del 1.0
3   6789    undel   1.0
4   2423    del 1.0
5   2423    undel   0.0

CodePudding user response：

Here is my try with window functions

It is going to work only for described scenario when you have at most 2 records, one with del and second with undel. If there are duplicates it needs to be adjusted

import datetime
from pyspark.sql import Window
import pyspark.sql.functions as F

x = [
    (1234, "del"),
    (1234, "undel"),
    (4567, "del"),
    (6789, "undel"),
    (2423, "del"),
    (2423, "undel"),
]
df = spark.createDataFrame(x, schema=["number", "reason"])

window = Window.partitionBy("number").orderBy(F.col("reason").asc())
dfWithRowNumber = df.withColumn("row_number", F.row_number().over(window))
dfWithToFilterColumn = dfWithRowNumber.withColumn(
    "toFilter", F.when(F.col("row_number") == F.lit(1), F.lit(1)).otherwise(F.lit(0))
).drop("row_number")


dfWithToFilterColumn.show()

The output is:

 ------ ------ -------- 
|number|reason|toFilter|
 ------ ------ -------- 
|  1234|   del|       1|
|  1234| undel|       0|
|  2423|   del|       1|
|  2423| undel|       0|
|  4567|   del|       1|
|  6789| undel|       1|
 ------ ------ --------

CodePudding user response：

Window function is your friend, If you want to filter do the following

(df.withColumn('x', dense_rank().over(Window.partitionBy('number').orderBy(monotonically_increasing_id())))#Create colum that numbers the reasons in each group
 .where((col('x')==1))#Filter out duplicated reasons in each group
 .drop('x')#Drop the filter column
).show()

 ------ ------ 
|number|reason|
 ------ ------ 
|  1234|   del|
|  2423|   del|
|  4567|   del|
|  6789| undel|
 ------ ------

If you want display zeros and ones do the following

(df.withColumn('x', dense_rank().over(Window.partitionBy('number').orderBy(monotonically_increasing_id())))#Create colum that numbers the reasons in each group
 .withColumn('x', when(col('x')==1,1).otherwise(0))#Filter out duplicated reasons in each group
 
).show()

 ------ ------ --- 
|number|reason|  x|
 ------ ------ --- 
|  1234|   del|  1|
|  1234| undel|  0|
|  2423|   del|  1|
|  2423| undel|  0|
|  4567|   del|  1|
|  6789| undel|  1|
 ------ ------ ---