This is what the dataframe looks like:
--- ----------------------------------------- -----
|eco|eco_name |count|
--- ----------------------------------------- -----
|B63|Sicilian, Richter-Rauzer Attack |5 |
|D86|Grunfeld, Exchange |3 |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd |5 |
|A44|Old Benoni Defense |3 |
|C46|Three Knights |1 |
|C08|French, Tarrasch, Open, 4.ed ed |13 |
|E59|Nimzo-Indian, 4.e3, Main line |2 |
|A20|English |2 |
|B20|Sicilian |4 |
|B37|Sicilian, Accelerated Fianchetto |2 |
|A33|English, Symmetrical |8 |
|C77|Ruy Lopez |8 |
|B43|Sicilian, Kan, 5.Nc3 |10 |
|A04|Reti Opening |6 |
|A59|Benko Gambit |1 |
|A54|Old Indian, Ukrainian Variation, 4.Nf3 |3 |
|D30|Queen's Gambit Declined |19 |
|C01|French, Exchange |3 |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1 |
|E74|King's Indian, Averbakh, 6...c5 |2 |
--- ----------------------------------------- -----
Schema:
root
|-- eco: string (nullable = true)
|-- eco_name: string (nullable = true)
|-- count: long (nullable = false)
I want to filter it so that only two rows with minimum and maximum counts remain.
The output dataframe should look something like:
--- ----------------------------------------- --------------------
|eco|eco_name |number_of_occurences|
--- ----------------------------------------- --------------------
|D30|Queen's Gambit Declined |19 |
|C46|Three Knights |1 |
--- ----------------------------------------- --------------------
I'm a beginner, I'm really sorry if this is a stupid question.
CodePudding user response:
No need to apologize since this is the place to learn! One of the solutions is to use the Window
and rank
to find the min/max row:
df = spark.createDataFrame(
[('a', 1), ('b', 1), ('c', 2), ('d', 3)],
schema=['col1', 'col2']
)
df.show(10, False)
---- ----
|col1|col2|
---- ----
|a |1 |
|b |1 |
|c |2 |
|d |3 |
---- ----
Just use filtering to find the min/max count row after the ranking:
df\
.withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
.withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
.filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
.show(100, False)
---- ---- ------- -------
|col1|col2|min_row|max_row|
---- ---- ------- -------
|d |3 |4 |1 |
|a |1 |1 |3 |
|b |1 |1 |3 |
---- ---- ------- -------
Please note that if the min/max row count are the same, they will be both filtered out.
CodePudding user response:
You can use row_number function twice to order records by count, ascending and descending.
SELECT eco, eco_name, count
FROM (SELECT *,
row_number() over (order by count asc) as rna,
row_number() over (order by count desc) as rnd
FROM df)
WHERE rna = 1 or rnd = 1;
Note there's a tie for count = 1
. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.