I have a hive table with column type MAP, getting errors while running the below Spark SQL queries:
df = spark.sql("""select distinct name, details from table_name""");
AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column details is map<string,string>;
df = spark.sql("""select name, details
from table_name
group by name, details""");
AnalysisException: expression table_name.details cannot be used as a grouping expression because its data type map<string,string> is not an orderable data type.;
Table:
Column_name datatype
----------------------------------------
name string
details map<string,string>
CodePudding user response:
You can first use ROW_NUMBER()
window function to enumerate rows inside partitions and then select only rows where the ROW_NUMBER
results in 1
.
Example input:
df = spark.createDataFrame([('n', {'m': '1'}), ('n', {'m': '1'})], ['name', 'details'])
df.createOrReplaceTempView("table_name")
df.show()
# ---- --------
# |name| details|
# ---- --------
# | n|{m -> 1}|
# | n|{m -> 1}|
# ---- --------
Extracting only distinct records:
df_row_num = spark.sql("""
WITH cte_row_num AS (
SELECT name
,details
,ROW_NUMBER() OVER (
PARTITION BY name
,sort_array(map_keys(details))
,sort_array(map_values(details))
ORDER BY name) as row_num
FROM table_name)
SELECT name
,details
FROM cte_row_num
WHERE row_num = 1
""")
df_row_num.show()
# ---- --------
# |name| details|
# ---- --------
# | n|{m -> 1}|
# ---- --------
CodePudding user response:
It seems you can convert the map column to array of structs using map_entries
and then take a distinct. Then convert it back to a map column.
Here's a working example
data_sdf.show()
# ---- --------
# |name| details|
# ---- --------
# | n|{m -> 1}|
# | n|{m -> 1}|
# ---- --------
data_sdf.createOrReplaceTempView('data_tbl')
spark.sql('''
select name, map_from_entries(details_entries) as details
from (
select distinct name, sort_array(map_entries(details)) as details_entries
from data_tbl)
'''). \
show()
# ---- --------
# |name| details|
# ---- --------
# | n|{m -> 1}|
# ---- --------