Home > Blockchain >  Filter rows with minimum and maximum count
Filter rows with minimum and maximum count

Time:09-28

This is what the dataframe looks like:

 --- ----------------------------------------- ----- 
|eco|eco_name                                 |count|
 --- ----------------------------------------- ----- 
|B63|Sicilian, Richter-Rauzer Attack          |5    |
|D86|Grunfeld, Exchange                       |3    |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd     |5    |
|A44|Old Benoni Defense                       |3    |
|C46|Three Knights                            |1    |
|C08|French, Tarrasch, Open, 4.ed ed          |13   |
|E59|Nimzo-Indian, 4.e3, Main line            |2    |
|A20|English                                  |2    |
|B20|Sicilian                                 |4    |
|B37|Sicilian, Accelerated Fianchetto         |2    |
|A33|English, Symmetrical                     |8    |
|C77|Ruy Lopez                                |8    |
|B43|Sicilian, Kan, 5.Nc3                     |10   |
|A04|Reti Opening                             |6    |
|A59|Benko Gambit                             |1    |
|A54|Old Indian, Ukrainian Variation, 4.Nf3   |3    |
|D30|Queen's Gambit Declined                  |19   |
|C01|French, Exchange                         |3    |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1    |
|E74|King's Indian, Averbakh, 6...c5          |2    |
 --- ----------------------------------------- ----- 

Schema:

root
 |-- eco: string (nullable = true)
 |-- eco_name: string (nullable = true)
 |-- count: long (nullable = false)

I want to filter it so that only two rows with minimum and maximum counts remain.

The output dataframe should look something like:

 --- ----------------------------------------- -------------------- 
|eco|eco_name                                 |number_of_occurences|
 --- ----------------------------------------- -------------------- 
|D30|Queen's Gambit Declined                  |19                  |
|C46|Three Knights                            |1                   |
 --- ----------------------------------------- -------------------- 

I'm a beginner, I'm really sorry if this is a stupid question.

CodePudding user response:

No need to apologize since this is the place to learn! One of the solutions is to use the Window and rank to find the min/max row:

df = spark.createDataFrame(
    [('a', 1), ('b', 1), ('c', 2), ('d', 3)],
    schema=['col1', 'col2']
)

df.show(10, False)
 ---- ---- 
|col1|col2|
 ---- ---- 
|a   |1   |
|b   |1   |
|c   |2   |
|d   |3   |
 ---- ---- 

Just use filtering to find the min/max count row after the ranking:

df\
    .withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
    .withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
    .filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
    .show(100, False)
 ---- ---- ------- ------- 
|col1|col2|min_row|max_row|
 ---- ---- ------- ------- 
|d   |3   |4      |1      |
|a   |1   |1      |3      |
|b   |1   |1      |3      |
 ---- ---- ------- ------- 

Please note that if the min/max row count are the same, they will be both filtered out.

CodePudding user response:

You can use row_number function twice to order records by count, ascending and descending.

SELECT eco, eco_name, count
  FROM (SELECT *,
               row_number() over (order by count asc) as rna,
               row_number() over (order by count desc) as rnd
          FROM df)
 WHERE rna = 1 or rnd = 1;

Note there's a tie for count = 1. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.

  • Related