Home > Software engineering >  Convert String to Map in Spark
Convert String to Map in Spark

Time:11-14

Below data in csv file with delimeter |, I want to convert string to Map for PersonalInfo column data so that I can extract required information.

I try to convert below csv to parquet format with String to Map using Cast I got datatype mismatch error.

Below is data for your ref. Your help is much appreciated.

Empcode EmpName PersonalInfo
1       abc     """email"":""[email protected]"",""Location"":""India"",""Gender"":""Male"""
2       xyz     """email"":""[email protected]"",""Location"":""US"""
3       pqr     """email"":""[email protected]"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234"""

Thanks

CodePudding user response:

One simple way is to use str_to_map function after you get rid of the double quotes from PersonalInfo column:

val df1 = df.withColumn(
  "PersonalInfo",
  expr("str_to_map(regexp_replace(PersonalInfo, '\"', ''))")
)

df1.show(false)

// ------- ------- ------------------------------------------------------------------------------ 
//|Empcode|EmpName|PersonalInfo                                                                  |
// ------- ------- ------------------------------------------------------------------------------ 
//|1      |abc    |{email -> [email protected], Location -> India, Gender -> Male}                   |
//|2      |xyz    |{email -> [email protected], Location -> US}                                      |
//|3      |pqr    |{email -> [email protected], Gender -> Female, Location -> Europe, Mobile -> 1234}|
// ------- ------- ------------------------------------------------------------------------------ 

CodePudding user response:

If you want to create a map from PersonalInfo column, from Spark 3.0 you can proceed as follows:

  • Split your string according to "","" using split function
  • For each element of your obtained string array, create sub-arrays according to "":"" using split function
  • Remove all "" from elements of sub-arrays using regexp_replace function
  • Build map entries using struct function
  • Use map_from_entries to build map from your array of entries

Complete code is as follows:

import org.apache.spark.sql.functions.{col, map_from_entries, regexp_replace, split, struct, transform}

val result = data.withColumn("PersonalInfo",
  map_from_entries(
    transform(
      split(col("PersonalInfo"), "\"\",\"\""),
      item => struct(
        regexp_replace(split(item, "\"\":\"\"")(0), "\"\"", ""),
        regexp_replace(split(item, "\"\":\"\"")(1), "\"\"", "")
      )
    )
  )
)

With the following input_dataframe:

 ------- ------- --------------------------------------------------------------------------------------------- 
|Empcode|EmpName|PersonalInfo                                                                                 |
 ------- ------- --------------------------------------------------------------------------------------------- 
|1      |abc    |""email"":""[email protected]"",""Location"":""India"",""Gender"":""Male""                       |
|2      |xyz    |""email"":""[email protected]"",""Location"":""US""                                              |
|3      |pqr    |""email"":""[email protected]"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234""|
 ------- ------- --------------------------------------------------------------------------------------------- 

You get the following result dataframe:

 ------- ------- ------------------------------------------------------------------------------ 
|Empcode|EmpName|PersonalInfo                                                                  |
 ------- ------- ------------------------------------------------------------------------------ 
|1      |abc    |{email -> [email protected], Location -> India, Gender -> Male}                   |
|2      |xyz    |{email -> [email protected], Location -> US}                                      |
|3      |pqr    |{email -> [email protected], Gender -> Female, Location -> Europe, Mobile -> 1234}|
 ------- ------- ------------------------------------------------------------------------------ 
  • Related