I have a dataframe that contains a field item which is a string having a array of items:
[{"item":"76CJMX4Y"},{"item":"7PWZVWCG"},{"item":"967NBPMS"},{"item":"72LC5SMF"},{"item":"8N6DW3VD"},{"item":"045QHTU4"},{"item":"0UL4MMSI"}]
root
|-- item: string (nullable = true)
I would like to get the item as array of string. Can someone let me know if there is a easy way to do this with default from_json ?
root |-- item: array (nullable = true)
So that I will only have
["76CJMX4Y", "7PWZVWCG", "967NBPMS", "72LC5SMF", "8N6DW3VD", "045QHTU4", "0UL4MMSI"]
Thanks
CodePudding user response:
use spark inbuilt functions from_json
and then use higher order function transform
to extract item from the array.
Example:
//from_json we are creating a json array then extracting item from the array
import org.apache.spark.sql.functions._
df.selectExpr("""transform(from_json(item,'array<struct<item:string>>'),x->x.item) as item""").show(10,false)
// ----------------------------------------------------------------------
//|item |
// ----------------------------------------------------------------------
//|[76CJMX4Y, 7PWZVWCG, 967NBPMS, 72LC5SMF, 8N6DW3VD, 045QHTU4, 0UL4MMSI]|
// ----------------------------------------------------------------------
CodePudding user response:
You could use split() on :, then sort with sort_array() the values (so that the values you’re not interested in are either at the top or bottom, then filter using slice().
For your reference: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html (even if it’s the Java version, it’s the synthetic list of functions).