Home > Mobile >  How to filter out a dataframe based on specific condition?
How to filter out a dataframe based on specific condition?

Time:01-21

Scenario: If a number has two records, one with del and one with undel reason then only del will be selected. If a number has only one record with undel or del then it will also be selected.

Example: I have a dataframe with 2 columns as below:

number reason
1234 del
1234 undel
4567 del
6789 undel
2423 del
2423 undel

My expected output is as below:

number reason tofilter
1234 del 1
1234 undel 0
4567 del 1
6789 undel 1
2423 del 1
2423 undel 0

Here I want to filter out only those records where tofilter is 1.

CodePudding user response:

You can generate your expected output from the dataframe you provided with this, assuming that there are only two samples in case of duplicated numbers.

data = {'number': ['1234', '1234', '4567', '6789', '2423', '2423'],
    'reason': ['del', 'undel', 'del', 'undel', 'del', 'undel']} 
df = pd.DataFrame(data, columns=['number', 'reason'])

unique_vals = df.number.value_counts().to_dict()

for index, item in enumerate(df.iterrows()):
    nbr = item[1]['number']
    filter_check = unique_vals[nbr]
    indexes = df[ df['number'] == nbr ]['reason'].index.to_list()
    if filter_check == 2:
        if 'to_filter' not in df.columns:
             df.insert(2, "to_filter", pd.Series([1, 0], index=[idx for idx in indexes]), allow_duplicates= False)
        else:
            df['to_filter'].iloc[indexes[0]] = 1
            df['to_filter'].iloc[indexes[1]] = 0
        
    elif filter_check == 1:
        df['to_filter'].iloc[indexes[0]] = 1  

which outputs:

number  reason  to_filter
0   1234    del 1.0
1   1234    undel   0.0
2   4567    del 1.0
3   6789    undel   1.0
4   2423    del 1.0
5   2423    undel   0.0

CodePudding user response:

Here is my try with window functions

It is going to work only for described scenario when you have at most 2 records, one with del and second with undel. If there are duplicates it needs to be adjusted

import datetime
from pyspark.sql import Window
import pyspark.sql.functions as F

x = [
    (1234, "del"),
    (1234, "undel"),
    (4567, "del"),
    (6789, "undel"),
    (2423, "del"),
    (2423, "undel"),
]
df = spark.createDataFrame(x, schema=["number", "reason"])

window = Window.partitionBy("number").orderBy(F.col("reason").asc())
dfWithRowNumber = df.withColumn("row_number", F.row_number().over(window))
dfWithToFilterColumn = dfWithRowNumber.withColumn(
    "toFilter", F.when(F.col("row_number") == F.lit(1), F.lit(1)).otherwise(F.lit(0))
).drop("row_number")


dfWithToFilterColumn.show()

The output is:

 ------ ------ -------- 
|number|reason|toFilter|
 ------ ------ -------- 
|  1234|   del|       1|
|  1234| undel|       0|
|  2423|   del|       1|
|  2423| undel|       0|
|  4567|   del|       1|
|  6789| undel|       1|
 ------ ------ -------- 

CodePudding user response:

Window function is your friend, If you want to filter do the following

(df.withColumn('x', dense_rank().over(Window.partitionBy('number').orderBy(monotonically_increasing_id())))#Create colum that numbers the reasons in each group
 .where((col('x')==1))#Filter out duplicated reasons in each group
 .drop('x')#Drop the filter column
).show()

 ------ ------ 
|number|reason|
 ------ ------ 
|  1234|   del|
|  2423|   del|
|  4567|   del|
|  6789| undel|
 ------ ------ 

If you want display zeros and ones do the following

(df.withColumn('x', dense_rank().over(Window.partitionBy('number').orderBy(monotonically_increasing_id())))#Create colum that numbers the reasons in each group
 .withColumn('x', when(col('x')==1,1).otherwise(0))#Filter out duplicated reasons in each group
 
).show()

 ------ ------ --- 
|number|reason|  x|
 ------ ------ --- 
|  1234|   del|  1|
|  1234| undel|  0|
|  2423|   del|  1|
|  2423| undel|  0|
|  4567|   del|  1|
|  6789| undel|  1|
 ------ ------ --- 
  • Related