Spark: How to convert a String to multiple columns-CodePudding

I have a dataframe that contains a field item which is a string having a array of items:

[{"item":"76CJMX4Y"},{"item":"7PWZVWCG"},{"item":"967NBPMS"},{"item":"72LC5SMF"},{"item":"8N6DW3VD"},{"item":"045QHTU4"},{"item":"0UL4MMSI"}]

root
 |-- item: string (nullable = true)

I would like to get the item as array of string. Can someone let me know if there is a easy way to do this with default from_json ?

root |-- item: array (nullable = true)

So that I will only have

["76CJMX4Y", "7PWZVWCG", "967NBPMS", "72LC5SMF", "8N6DW3VD", "045QHTU4", "0UL4MMSI"]

Thanks

CodePudding user response：

use spark inbuilt functions from_json and then use higher order function transform to extract item from the array.

Example:

//from_json we are creating a json array then extracting item from the array
import org.apache.spark.sql.functions._


df.selectExpr("""transform(from_json(item,'array<struct<item:string>>'),x->x.item) as item""").show(10,false)

// ---------------------------------------------------------------------- 
//|item                                                                  |
// ---------------------------------------------------------------------- 
//|[76CJMX4Y, 7PWZVWCG, 967NBPMS, 72LC5SMF, 8N6DW3VD, 045QHTU4, 0UL4MMSI]|
// ----------------------------------------------------------------------

CodePudding user response：

You could use split() on :, then sort with sort_array() the values (so that the values you’re not interested in are either at the top or bottom, then filter using slice().

For your reference: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html (even if it’s the Java version, it’s the synthetic list of functions).