Home > Back-end >  create list of values from array of maps in pyspark
create list of values from array of maps in pyspark

Time:12-23

I have a table like this

company_id  | an_array_of_maps
--------------------------------------------------------------
234         | [{"a": "a2", "b": "b2"}, {"a": "a4", "b": "b2"}]
123         | [{"a": "a1", "b": "b1"}, {"a": "a1", "b": "b1"}]
678         | [{"b": "b5", "c": "c5"}, {"b": Null, "c": "c5"}]

and i want to get a table like this (the value of the "a" key in each map)

company_id  | an_array_of_maps
--------------------------------------------------------------
234         | ["a2", "a4"]
123         | ["a1", "a1"]
678         | ["b5", Null]

I tried this df.withColumn("array_of_as", F.expr("filter(an_array_of_maps, x -> x.a)")).show()

but i get the following error:

AnalysisException: cannot resolve 'filter(`an_array_of_maps`, lambdafunction(namedlambdavariable()['a'], namedlambdavariable()))' due to data type mismatch: argument 2 requires boolean type, however, 'lambdafunction(namedlambdavariable()['a'], namedlambdavariable())' is of string type.;

CodePudding user response:

Got it - filter is the wrong function. It should be:

(df
  .withColumn("array_of_as", 
              F.expr("transform(an_array_of_maps, x -> x.a)"))
 ).show()

I was not filtering anything i was transforming the list of maps into a list of values of the maps - hence the transform.

  • Related