The follow DataFrame should be filtered based on the flag column. If the group based on columns id and cod doesn't have any row with value different of None, it's necessary to maintain just a unique row, otherwise, it's necessary to remove the row with None value in column flag.
import pandas as pd
[['1', 10, 'A'],
['1', 10, 'A'],
['1', 10, None],
['1', 15, 'A'],
['1', 15, None],
['2', 11, 'A'],
['2', 12, 'B'],
['2', 12, 'B'],
['2', 13, None],
['3', 14, None],
['3', 14, None],
['3', 15, None]]
df = pd.DataFrame(data, columns=['id', 'cod','flag'])
How could I obtain the next DataFrame based on last one using PySpark?
data = [['1', 10, 'A'],
['1', 15, 'A'],
['2', 11, 'A'],
['2', 12, 'B'],
['2', 13, None],
['3', 14, None],
['3', 15, None]]
df = pd.DataFrame(data, columns=['id', 'cod','flag'])
CodePudding user response:
One way is this:
df = df.groupBy('id', 'cod').agg(F.max('flag').alias('flag'))
You couldn't just remove duplicates based on columns id and cod, as there's no guarantee that you will always get a value from column flag which is not null.
CodePudding user response:
this could be a solution
new_df = pd.DataFrame(columns=df.columns)
for index, row in df.iterrows():
if row.values.tolist() not in new_df.values.tolist():
if row["flag"] is None and row.values.tolist()[:2] in new_df[list(new_df.columns)[:2]].values.tolist():
continue
new_df.loc[-1] = row.values.tolist()
new_df.index = 1
at the end you could just add
df = new_df.copy(deep=True)
del new_df
CodePudding user response:
With PySpark
. Adapted from this answer (Spark)
window = Window.partitionBy(['id', 'cod']).orderBy(col('flag').desc())
out = (df.withColumn('row', row_number().over(window))
.filter(col('row') == 1).drop('row'))
out.show()
# Output
--- --- ----
| id|cod|flag|
--- --- ----
| 2| 11| A|
| 2| 12| B|
| 1| 15| A|
| 3| 14|null|
| 2| 13|null|
| 3| 15|null|
| 1| 10| A|
--- --- ----
Setup
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
data = [['1', 10, 'A'],
['1', 10, 'A'],
['1', 10, None],
['1', 15, 'A'],
['1', 15, None],
['2', 11, 'A'],
['2', 12, 'B'],
['2', 12, 'B'],
['2', 13, None],
['3', 14, None],
['3', 14, None],
['3', 15, None]]
columns = ['id', 'cod', 'flag']
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data = data, schema = columns)
df.show()
# Output
--- --- ----
| id|cod|flag|
--- --- ----
| 1| 10| A|
| 1| 10| A|
| 1| 10|null|
| 1| 15| A|
| 1| 15|null|
| 2| 11| A|
| 2| 12| B|
| 2| 12| B|
| 2| 13|null|
| 3| 14|null|
| 3| 14|null|
| 3| 15|null|
--- --- ----