Scenario: If a number has two records, one with del and one with undel reason then only del will be selected. If a number has only one record with undel or del then it will also be selected.
Example: I have a dataframe with 2 columns as below:
number | reason |
---|---|
1234 | del |
1234 | undel |
4567 | del |
6789 | undel |
2423 | del |
2423 | undel |
My expected output is as below:
number | reason | tofilter |
---|---|---|
1234 | del | 1 |
1234 | undel | 0 |
4567 | del | 1 |
6789 | undel | 1 |
2423 | del | 1 |
2423 | undel | 0 |
Here I want to filter out only those records where tofilter is 1.
CodePudding user response:
You can generate your expected output from the dataframe you provided with this, assuming that there are only two samples in case of duplicated numbers.
data = {'number': ['1234', '1234', '4567', '6789', '2423', '2423'],
'reason': ['del', 'undel', 'del', 'undel', 'del', 'undel']}
df = pd.DataFrame(data, columns=['number', 'reason'])
unique_vals = df.number.value_counts().to_dict()
for index, item in enumerate(df.iterrows()):
nbr = item[1]['number']
filter_check = unique_vals[nbr]
indexes = df[ df['number'] == nbr ]['reason'].index.to_list()
if filter_check == 2:
if 'to_filter' not in df.columns:
df.insert(2, "to_filter", pd.Series([1, 0], index=[idx for idx in indexes]), allow_duplicates= False)
else:
df['to_filter'].iloc[indexes[0]] = 1
df['to_filter'].iloc[indexes[1]] = 0
elif filter_check == 1:
df['to_filter'].iloc[indexes[0]] = 1
which outputs:
number reason to_filter
0 1234 del 1.0
1 1234 undel 0.0
2 4567 del 1.0
3 6789 undel 1.0
4 2423 del 1.0
5 2423 undel 0.0
CodePudding user response:
Here is my try with window functions
It is going to work only for described scenario when you have at most 2 records, one with del and second with undel. If there are duplicates it needs to be adjusted
import datetime
from pyspark.sql import Window
import pyspark.sql.functions as F
x = [
(1234, "del"),
(1234, "undel"),
(4567, "del"),
(6789, "undel"),
(2423, "del"),
(2423, "undel"),
]
df = spark.createDataFrame(x, schema=["number", "reason"])
window = Window.partitionBy("number").orderBy(F.col("reason").asc())
dfWithRowNumber = df.withColumn("row_number", F.row_number().over(window))
dfWithToFilterColumn = dfWithRowNumber.withColumn(
"toFilter", F.when(F.col("row_number") == F.lit(1), F.lit(1)).otherwise(F.lit(0))
).drop("row_number")
dfWithToFilterColumn.show()
The output is:
------ ------ --------
|number|reason|toFilter|
------ ------ --------
| 1234| del| 1|
| 1234| undel| 0|
| 2423| del| 1|
| 2423| undel| 0|
| 4567| del| 1|
| 6789| undel| 1|
------ ------ --------
CodePudding user response:
Window function is your friend, If you want to filter do the following
(df.withColumn('x', dense_rank().over(Window.partitionBy('number').orderBy(monotonically_increasing_id())))#Create colum that numbers the reasons in each group
.where((col('x')==1))#Filter out duplicated reasons in each group
.drop('x')#Drop the filter column
).show()
------ ------
|number|reason|
------ ------
| 1234| del|
| 2423| del|
| 4567| del|
| 6789| undel|
------ ------
If you want display zeros and ones do the following
(df.withColumn('x', dense_rank().over(Window.partitionBy('number').orderBy(monotonically_increasing_id())))#Create colum that numbers the reasons in each group
.withColumn('x', when(col('x')==1,1).otherwise(0))#Filter out duplicated reasons in each group
).show()
------ ------ ---
|number|reason| x|
------ ------ ---
| 1234| del| 1|
| 1234| undel| 0|
| 2423| del| 1|
| 2423| undel| 0|
| 4567| del| 1|
| 6789| undel| 1|
------ ------ ---