Filter rows with minimum and maximum count-CodePudding

This is what the dataframe looks like:

 --- ----------------------------------------- ----- 
|eco|eco_name                                 |count|
 --- ----------------------------------------- ----- 
|B63|Sicilian, Richter-Rauzer Attack          |5    |
|D86|Grunfeld, Exchange                       |3    |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd     |5    |
|A44|Old Benoni Defense                       |3    |
|C46|Three Knights                            |1    |
|C08|French, Tarrasch, Open, 4.ed ed          |13   |
|E59|Nimzo-Indian, 4.e3, Main line            |2    |
|A20|English                                  |2    |
|B20|Sicilian                                 |4    |
|B37|Sicilian, Accelerated Fianchetto         |2    |
|A33|English, Symmetrical                     |8    |
|C77|Ruy Lopez                                |8    |
|B43|Sicilian, Kan, 5.Nc3                     |10   |
|A04|Reti Opening                             |6    |
|A59|Benko Gambit                             |1    |
|A54|Old Indian, Ukrainian Variation, 4.Nf3   |3    |
|D30|Queen's Gambit Declined                  |19   |
|C01|French, Exchange                         |3    |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1    |
|E74|King's Indian, Averbakh, 6...c5          |2    |
 --- ----------------------------------------- -----

Schema:

root
 |-- eco: string (nullable = true)
 |-- eco_name: string (nullable = true)
 |-- count: long (nullable = false)

I want to filter it so that only two rows with minimum and maximum counts remain.

The output dataframe should look something like:

 --- ----------------------------------------- -------------------- 
|eco|eco_name                                 |number_of_occurences|
 --- ----------------------------------------- -------------------- 
|D30|Queen's Gambit Declined                  |19                  |
|C46|Three Knights                            |1                   |
 --- ----------------------------------------- --------------------

I'm a beginner, I'm really sorry if this is a stupid question.

CodePudding user response：

No need to apologize since this is the place to learn! One of the solutions is to use the Window and rank to find the min/max row:

df = spark.createDataFrame(
    [('a', 1), ('b', 1), ('c', 2), ('d', 3)],
    schema=['col1', 'col2']
)

df.show(10, False)
 ---- ---- 
|col1|col2|
 ---- ---- 
|a   |1   |
|b   |1   |
|c   |2   |
|d   |3   |
 ---- ----

Just use filtering to find the min/max count row after the ranking:

df\
    .withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
    .withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
    .filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
    .show(100, False)
 ---- ---- ------- ------- 
|col1|col2|min_row|max_row|
 ---- ---- ------- ------- 
|d   |3   |4      |1      |
|a   |1   |1      |3      |
|b   |1   |1      |3      |
 ---- ---- ------- -------

Please note that if the min/max row count are the same, they will be both filtered out.

CodePudding user response：

You can use row_number function twice to order records by count, ascending and descending.

SELECT eco, eco_name, count
  FROM (SELECT *,
               row_number() over (order by count asc) as rna,
               row_number() over (order by count desc) as rnd
          FROM df)
 WHERE rna = 1 or rnd = 1;

Note there's a tie for count = 1. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.