How to filter based on group condition-CodePudding

The follow DataFrame should be filtered based on the flag column. If the group based on columns id and cod doesn't have any row with value different of None, it's necessary to maintain just a unique row, otherwise, it's necessary to remove the row with None value in column flag.

import pandas as pd

[['1', 10, 'A'],
    ['1', 10, 'A'],
    ['1', 10, None],
    ['1', 15, 'A'],
    ['1', 15, None],
    ['2', 11, 'A'],
    ['2', 12, 'B'],
    ['2', 12, 'B'],
    ['2', 13, None],
    ['3', 14, None],
    ['3', 14, None],
    ['3', 15, None]]

df = pd.DataFrame(data, columns=['id', 'cod','flag'])

How could I obtain the next DataFrame based on last one using PySpark?

data = [['1', 10, 'A'],
        ['1', 15, 'A'],
        ['2', 11, 'A'],
        ['2', 12, 'B'],
        ['2', 13, None],
        ['3', 14, None],
        ['3', 15, None]]
df = pd.DataFrame(data, columns=['id', 'cod','flag'])

CodePudding user response：

One way is this:

df = df.groupBy('id', 'cod').agg(F.max('flag').alias('flag'))

You couldn't just remove duplicates based on columns id and cod, as there's no guarantee that you will always get a value from column flag which is not null.

CodePudding user response：

this could be a solution

new_df = pd.DataFrame(columns=df.columns)

for index, row in df.iterrows():
    if row.values.tolist() not in new_df.values.tolist():

        if row["flag"] is None and row.values.tolist()[:2] in new_df[list(new_df.columns)[:2]].values.tolist():
            continue

        new_df.loc[-1] = row.values.tolist()
        new_df.index  = 1

at the end you could just add

df = new_df.copy(deep=True)
del new_df

CodePudding user response：

With PySpark. Adapted from this answer (Spark)

window = Window.partitionBy(['id', 'cod']).orderBy(col('flag').desc())
out = (df.withColumn('row',  row_number().over(window))
         .filter(col('row') == 1).drop('row'))
out.show()

# Output
 --- --- ---- 
| id|cod|flag|
 --- --- ---- 
|  2| 11|   A|
|  2| 12|   B|
|  1| 15|   A|
|  3| 14|null|
|  2| 13|null|
|  3| 15|null|
|  1| 10|   A|
 --- --- ----

Setup

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

data = [['1', 10, 'A'],
        ['1', 10, 'A'],
        ['1', 10, None],
        ['1', 15, 'A'],
        ['1', 15, None],
        ['2', 11, 'A'],
        ['2', 12, 'B'],
        ['2', 12, 'B'],
        ['2', 13, None],
        ['3', 14, None],
        ['3', 14, None],
        ['3', 15, None]]

columns = ['id', 'cod', 'flag']
spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(data = data, schema = columns)
df.show()

# Output
 --- --- ---- 
| id|cod|flag|
 --- --- ---- 
|  1| 10|   A|
|  1| 10|   A|
|  1| 10|null|
|  1| 15|   A|
|  1| 15|null|
|  2| 11|   A|
|  2| 12|   B|
|  2| 12|   B|
|  2| 13|null|
|  3| 14|null|
|  3| 14|null|
|  3| 15|null|
 --- --- ----