I know that you can use the provided helper functions to retrieve subfields from message segments for example
val nameDf = df.select(segment_field("PID", 4).alias("name"))
Is there a way I can extract the entire message segment ("PID") instead of having to put in an index for each subfield?
PID|||d40726da-9b7a-49eb-9eeb-e406708bbb60||Heller^Keneth||||||140 Pacocha Way Suite 52^^Northampton^Massachusetts^^USA
CodePudding user response:
segment_field
as implemented here is a helper method to extract a single value given the index.
/**
* Extracts a field from a message segment.
*
* @param segment The ID of the segment to extract.
* @param field The index of the field to extract.
* @param segmentColumn The name of the column containing message segments.
* Defaults to "segments".
* @return Yields a new column containing the field of a message segment.
*
* @note If there are multiple segments with the same ID, this function will
* select the field from one of the segments. Order is undefined.
*/
def segment_field(segment: String,
field: Int,
segmentColumn: Column = col("segments")): Column = {
filter(segmentColumn, s => s("id") === lit(segment))
.getItem(0)
.getField("fields")
.getItem(field)
}
However, if you are interested in extracting all the field values regardless of index you may replicate this behaviour as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def segment_fields(segment: String,
segmentColumn: Column = col("segments")): Column = {
filter(segmentColumn, s => s("id") === lit(segment))
.getItem(0)
.getField("fields")
}
and use as such
val nameDf = df.select(segment_fields("PID").alias("values"))
You may then extract or transform the data as desired.