Home > Software engineering >  Spark SQL: Select 10 randomly chosen groups of value in column of table
Spark SQL: Select 10 randomly chosen groups of value in column of table

Time:11-10

I have a table, in a database. There is a column "id" and to each id there are multiple rows. I want to randomly pick 10 ids and then load all the rows with these ids. The following is working and gives me a table with 10 random ids. That's fine.

distinct_ids = spark.sql(f""" (SELECT DISTINCT id  FROM {database_name}.{orig_table_cl_name} ORDER BY RAND() LIMIT 10)

But now I am not getting how to do an inner join with the original table to get all the data containing the right ids... I tried:

distinct_ids = spark.sql(
f""" (SELECT DISTINCT vehicle_id 
FROM {database_name}.{orig_table_name} 
ORDER BY RAND() LIMIT 10) 
AS table
INNER JOIN table
ON {database_name}.{orig_table_name}.id = table.id""")

Giving me following error:

ParseException: 
mismatched input 'AS' expecting {<EOF>, ';'}(line 1, pos 99)

== SQL ==
 (SELECT DISTINCT vehicle_id FROM pnds12v_dev_core.t_cycle_log_car_v2_0  ORDER BY RAND() LIMIT 10) AS table

I tried a bunch of other orders of the arguments but figured I am missing the general understanding of how this works... and since I thought there are no semicolons when using spark.sql the error message does not give me any hints.

Can someone explain to me how to do such a query?

CodePudding user response:

Use common table expression (CTE) to define a temporary result set.

distinct_ids = spark.sql(
f"""with table as
    (SELECT DISTINCT vehicle_id as id
    FROM {database_name}.{orig_table_name} 
    ORDER BY RAND() LIMIT 10)
select * from table INNER JOIN {database_name}.{orig_table_name}
ON {database_name}.{orig_table_name}.id = table.id""")

CodePudding user response:

The semicolon error more or less means that the AS keyword was not expected to be found there, and that a semicolon (end of query) would have been fine in its stead.

One way to solve your issue is to "reverse" the sql query, and join the original table with your secondary clause selecting 10 random ids:

spark.sql(
f"""SELECT {database_name}.{orig_table_name}.*
FROM {database_name}.{orig_table_name} 
INNER JOIN (select distinct {database_name}.{orig_table_name}.id ORDER BY RAND() LIMIT 10) t
ON {database_name}.{orig_table_name}.id = t.id"""
)
  • Related