Home > Software design >  Access dictionary value using Spark dataframe column as key
Access dictionary value using Spark dataframe column as key

Time:10-28

I have a dictionary below called sample.

sample = {
     10:1
     20:2
     30:3
}

I have a dataframe column called age for example and the column contains 10. I want to use this dataframe column as the key lookup to give me the value 1 if the key is 10 for example.

    df.withColumn("NewColumnname", dict.get(col("age")))

unhashable type: 'Column'

is coming up but not sure what way to use the column in the dataframe as a key lookup.

CodePudding user response:

withColumn's second argument is a Column object, that's why you got that error. If all you want is to do that lookup, then the best way is to do an inner join. I'd recommend avoiding UDF as much as possible.

converting your original dict to a "lookup" dataframe

sample = {10:1, 20:2, 30:3}

dict_df = spark.createDataFrame([(k, v) for k,v in sample.items()], schema = ['k', 'v'])

#  --- --- 
# |  k|  v|
#  --- --- 
# | 10|  1|
# | 20|  2|
# | 30|  3|
#  --- --- 

this is sample dataframe

df = spark.sparkContext.parallelize([(10,), (20,), (30,)]).toDF(['col'])

#  --- 
# |col|
#  --- 
# | 10|
# | 20|
# | 30|
#  --- 

perform inner join to get the lookup value

df.join(dict_df, on=[dict_df.k == df.col])

#  --- --- --- 
# |col|  k|  v|
#  --- --- --- 
# | 10| 10|  1|
# | 30| 30|  3|
# | 20| 20|  2|
#  --- --- --- 

CodePudding user response:

You are trying to look up the entire column in the dict, and this doesn't work, because a. columns are unhashable and b. there is never going to be a key in the dict which contains the entire column.

Clearly what you want to do is to look up each individual value in the column in the dict. Thre are two ways to do this, the right one depending on what you want to do with the data. Either make your dict into a dataframe and do it in pandas, or iterate the column and lookup. The latter is probably what you want here and would like something like this:

df["NewColumnname"] = df["age"].apply(lambda x: sample[x])

EDIT on second thoughts that lambda is a bit useless, and this is a usecase for .get:

df["new"] = df["age"].apply(sample.get)

Note that whilst dicts have a .get method, it's more pythonic to use [] access if you're just looking up a value which you know is there.

Returning

This modifies df, so we can simply add:

return df

and get df out of a function. If you need to work on a copy, make one first and then add the column in the same way.

Demonstration

import pandas as pd

df = pd.DataFrame([[7],[8]], columns=["col"])
d = {7: "seven", 8: "eight"}
df["new"] = df["col"].apply(d.get)
print(df)
  • Related