Below data in csv file with delimeter |
, I want to convert string to Map for PersonalInfo
column data so that I can extract required information.
I try to convert below csv to parquet format with String
to Map
using Cast I got datatype mismatch error.
Below is data for your ref. Your help is much appreciated.
Empcode EmpName PersonalInfo
1 abc """email"":""[email protected]"",""Location"":""India"",""Gender"":""Male"""
2 xyz """email"":""[email protected]"",""Location"":""US"""
3 pqr """email"":""[email protected]"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234"""
Thanks
CodePudding user response:
One simple way is to use str_to_map
function after you get rid of the double quotes from PersonalInfo
column:
val df1 = df.withColumn(
"PersonalInfo",
expr("str_to_map(regexp_replace(PersonalInfo, '\"', ''))")
)
df1.show(false)
// ------- ------- ------------------------------------------------------------------------------
//|Empcode|EmpName|PersonalInfo |
// ------- ------- ------------------------------------------------------------------------------
//|1 |abc |{email -> [email protected], Location -> India, Gender -> Male} |
//|2 |xyz |{email -> [email protected], Location -> US} |
//|3 |pqr |{email -> [email protected], Gender -> Female, Location -> Europe, Mobile -> 1234}|
// ------- ------- ------------------------------------------------------------------------------
CodePudding user response:
If you want to create a map from PersonalInfo
column, from Spark 3.0 you can proceed as follows:
- Split your string according to
"",""
usingsplit
function - For each element of your obtained string array, create sub-arrays according to
"":""
usingsplit
function - Remove all
""
from elements of sub-arrays usingregexp_replace
function - Build map entries using
struct
function - Use
map_from_entries
to build map from your array of entries
Complete code is as follows:
import org.apache.spark.sql.functions.{col, map_from_entries, regexp_replace, split, struct, transform}
val result = data.withColumn("PersonalInfo",
map_from_entries(
transform(
split(col("PersonalInfo"), "\"\",\"\""),
item => struct(
regexp_replace(split(item, "\"\":\"\"")(0), "\"\"", ""),
regexp_replace(split(item, "\"\":\"\"")(1), "\"\"", "")
)
)
)
)
With the following input_dataframe
:
------- ------- ---------------------------------------------------------------------------------------------
|Empcode|EmpName|PersonalInfo |
------- ------- ---------------------------------------------------------------------------------------------
|1 |abc |""email"":""[email protected]"",""Location"":""India"",""Gender"":""Male"" |
|2 |xyz |""email"":""[email protected]"",""Location"":""US"" |
|3 |pqr |""email"":""[email protected]"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234""|
------- ------- ---------------------------------------------------------------------------------------------
You get the following result
dataframe:
------- ------- ------------------------------------------------------------------------------
|Empcode|EmpName|PersonalInfo |
------- ------- ------------------------------------------------------------------------------
|1 |abc |{email -> [email protected], Location -> India, Gender -> Male} |
|2 |xyz |{email -> [email protected], Location -> US} |
|3 |pqr |{email -> [email protected], Gender -> Female, Location -> Europe, Mobile -> 1234}|
------- ------- ------------------------------------------------------------------------------