Home > Enterprise >  Spark dataframe slice
Spark dataframe slice

Time:04-07

I've a table with (millions of) entries along the lines of the following example read into a Spark dataframe (sdf):

Id C1 C2
xx1 c118 c219
xx1 c113 c218
xx1 c118 c214
acb c121 c201
e3d c181 c221
e3d c132 c252
abq c141 c290
... ... ...
vy1 c13023 C23021

I'd like to get a smaller subset of these Id's for further processing. I identify the unique set of Id's in the table using sdf_id = sdf.select("Id").dropDuplicates().

What is the efficient way from here to filter data (C1, C2) related to, let's say, 100 randomly selected Id's?

CodePudding user response:

Since you already have the list of unique ids , you can further sample it to your desired fraction and filter based on that

There are other ways you can sample random ids , which can be found here

Sampling

### Assuming the DF is 1 mil records , 100 records would be 0.01%

sdf_id = sdf.select("Id").dropDuplicates().sample(0.01).collect()

Filter

sdf_filtered = sdf.filter(F.col('Id').isin(sdf_id))

CodePudding user response:

There are several ways to achieve what you want.

My sample data
df = spark.createDataFrame([
    (1, 'a'),
    (1, 'b'),
    (1, 'c'),
    (2, 'd'),
    (2, 'e'),
    (3, 'f'),
], ['id', 'col'])
The initial step is getting the sample IDs that you wanted
ids = df.select('id').distinct().sample(0.2) # 2 is 20%, you can adjust this

 --- 
| id|
 --- 
|  1|
 --- 
Approach #1: using inner join

Since you have two dataframes, you can just perform a single inner join to get all records from df for each id in ids. Note that F.broadcast is to boost up the performance because ids suppose to be small enough. Feel free to take it away if you want to. Performance-wise, this approach is preferred.

df.join(F.broadcast(ids), on=['id'], how='inner').show()

 --- --- 
| id|col|
 --- --- 
|  1|  a|
|  1|  b|
|  1|  c|
 --- --- 
Approach #2: using isin

You can't simply get the list of IDs via ids.collect(), because that would return a list of Row, you have to loop through it to get the exact column that you want (id in this case).

df.where(F.col('id').isin([r['id'] for r in ids.collect()])).show()

 --- --- 
| id|col|
 --- --- 
|  1|  a|
|  1|  b|
|  1|  c|
 --- --- 
  • Related