I have a dataframe with this structure:
Col1 Col2 Col3 Col4 Col5 Col6
Apple 1.0
Apple 1.0 Fruit Green 0.99 OK
Apple Green 0.99
Apple 0.99
Banana 2.0 Fruit Yellow 1.29
Banana 2.0
Banana 2.0 Fruit
Coconut 2.2
Coconut Fruit
Coconut 2.2 Fruit Brown OK
I need keep the rows with more values informated.
The dataframe that i need with this example:
Col1 Col2 Col3 Col4 Col5 Col6
Apple 1.0 Fruit Green 0.99 OK
Banana 2.0 Fruit Yellow 1.29
Coconut 2.2 Fruit Brown OK
I need to apply this logic to a lot of dataframes (pyspark).
Edit: The No-Informeted values are not nulls, just cols without characters '(no-whitespace)'.
CodePudding user response:
For each value, convert it to boolean indicating wether the cell is filled. Sum on each row. Keep the rows with the maximum sum.
Here is the code, where df
is your input dataframe.
nb_col_filled = df.notna().sum(axis=1)
result_df = df[nb_col_filled==nb_col_filled.max()]
CodePudding user response:
Just use max
to get the "max" value of each column
sample data
df = spark.createDataFrame([
( 'Apple' , '1.0', '', '', '', ''),
( 'Apple' , '1.0', 'Fruit', 'Green', '0.99', 'OK'),
( 'Apple' , '', '', 'Green', '0.99', ''),
( 'Apple' , '0.99', '', '', '', ''),
( 'Banana', '2.0', 'Fruit', 'Yello', '1.29', ''),
( 'Banana', '2.0', '', '', '', ''),
( 'Banana', '2.0', 'Fruit', '', '', ''),
('Coconut', '2.2', '', '', '', ''),
('Coconut', '', 'Fruit', '', '', ''),
('Coconut', '2.2', 'Fruit', 'Brown', '', 'OK'),
], ['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6'])
Applying max
(df
.groupBy('Col1')
.agg(
F.max('Col2').alias('Col2'),
F.max('Col3').alias('Col3'),
F.max('Col4').alias('Col4'),
F.max('Col5').alias('Col5'),
F.max('Col6').alias('Col6'),
)
.show()
)
------- ---- ----- ----- ---- ----
| Col1|Col2| Col3| Col4|Col5|Col6|
------- ---- ----- ----- ---- ----
| Apple| 1.0|Fruit|Green|0.99| OK|
| Banana| 2.0|Fruit|Yello|1.29| |
|Coconut| 2.2|Fruit|Brown| | OK|
------- ---- ----- ----- ---- ----