Home > Blockchain >  Select Distinct record from hive map type column using Spark SQL
Select Distinct record from hive map type column using Spark SQL

Time:10-04

I have a hive table with column type MAP, getting errors while running the below Spark SQL queries:

df = spark.sql("""select distinct name, details from table_name""");

AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column details is map<string,string>;

df = spark.sql("""select name, details 
                  from table_name
                  group by name, details""");

AnalysisException: expression table_name.details cannot be used as a grouping expression because its data type map<string,string> is not an orderable data type.;

Table:

Column_name         datatype
----------------------------------------
name                string
details             map<string,string>

CodePudding user response:

You can first use ROW_NUMBER() window function to enumerate rows inside partitions and then select only rows where the ROW_NUMBER results in 1.

Example input:

df = spark.createDataFrame([('n', {'m': '1'}), ('n', {'m': '1'})], ['name', 'details'])
df.createOrReplaceTempView("table_name")
df.show()
#  ---- -------- 
# |name| details|
#  ---- -------- 
# |   n|{m -> 1}|
# |   n|{m -> 1}|
#  ---- -------- 

Extracting only distinct records:

df_row_num = spark.sql("""
    WITH cte_row_num AS (
        SELECT name
              ,details 
              ,ROW_NUMBER() OVER (
                  PARTITION BY name
                              ,sort_array(map_keys(details))
                              ,sort_array(map_values(details))
                  ORDER BY name) as row_num
        FROM table_name)
    SELECT name
          ,details 
    FROM cte_row_num
    WHERE row_num = 1
""")

df_row_num.show()
#  ---- -------- 
# |name| details|
#  ---- -------- 
# |   n|{m -> 1}|
#  ---- -------- 

CodePudding user response:

It seems you can convert the map column to array of structs using map_entries and then take a distinct. Then convert it back to a map column.

Here's a working example

data_sdf.show()

#  ---- -------- 
# |name| details|
#  ---- -------- 
# |   n|{m -> 1}|
# |   n|{m -> 1}|
#  ---- -------- 

data_sdf.createOrReplaceTempView('data_tbl')

spark.sql('''
    select name, map_from_entries(details_entries) as details
    from (
        select distinct name, sort_array(map_entries(details)) as details_entries
        from data_tbl)
    '''). \
    show()

#  ---- -------- 
# |name| details|
#  ---- -------- 
# |   n|{m -> 1}|
#  ---- -------- 
  • Related