Keep rows with more valid values-CodePudding

I have a dataframe with this structure:

    Col1     Col2     Col3     Col4     Col5     Col6
    Apple    1.0      
    Apple    1.0      Fruit    Green    0.99      OK
    Apple                      Green    0.99
    Apple                               0.99
    Banana   2.0      Fruit    Yellow   1.29      
    Banana   2.0     
    Banana   2.0      Fruit    
    Coconut  2.2      
    Coconut           Fruit          
    Coconut  2.2      Fruit    Brown              OK

I need keep the rows with more values informated.

The dataframe that i need with this example:

    Col1     Col2     Col3     Col4     Col5     Col6
    Apple    1.0      Fruit    Green    0.99      OK
    Banana   2.0      Fruit    Yellow   1.29
    Coconut  2.2      Fruit    Brown              OK

I need to apply this logic to a lot of dataframes (pyspark).

Edit: The No-Informeted values are not nulls, just cols without characters '(no-whitespace)'.

CodePudding user response：

For each value, convert it to boolean indicating wether the cell is filled. Sum on each row. Keep the rows with the maximum sum.

Here is the code, where df is your input dataframe.

nb_col_filled = df.notna().sum(axis=1)
result_df = df[nb_col_filled==nb_col_filled.max()]

CodePudding user response：

Just use max to get the "max" value of each column

sample data

df = spark.createDataFrame([
    ( 'Apple' ,  '1.0',      '',      '',     '',   ''),
    ( 'Apple' ,  '1.0', 'Fruit', 'Green', '0.99', 'OK'),
    ( 'Apple' ,     '',      '', 'Green', '0.99',   ''),
    ( 'Apple' , '0.99',      '',      '',     '',   ''),
    ( 'Banana',  '2.0', 'Fruit', 'Yello', '1.29',   ''),
    ( 'Banana',  '2.0',      '',      '',     '',   ''),
    ( 'Banana',  '2.0', 'Fruit',      '',     '',   ''),
    ('Coconut',  '2.2',      '',      '',     '',   ''),
    ('Coconut',     '', 'Fruit',      '',     '',   ''),
    ('Coconut',  '2.2', 'Fruit', 'Brown',     '', 'OK'),
], ['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6'])

Applying `max`

(df
    .groupBy('Col1')
    .agg(
        F.max('Col2').alias('Col2'),
        F.max('Col3').alias('Col3'),
        F.max('Col4').alias('Col4'),
        F.max('Col5').alias('Col5'),
        F.max('Col6').alias('Col6'),
    )
    .show()
)

 ------- ---- ----- ----- ---- ---- 
|   Col1|Col2| Col3| Col4|Col5|Col6|
 ------- ---- ----- ----- ---- ---- 
|  Apple| 1.0|Fruit|Green|0.99|  OK|
| Banana| 2.0|Fruit|Yello|1.29|    |
|Coconut| 2.2|Fruit|Brown|    |  OK|
 ------- ---- ----- ----- ---- ----

sample data

Applying max

Applying `max`