I need to derive two new fields from the below schema structure - new columns beaver_id
and llama_id
. There is some if else logic that needs to be applied to an array of struct. The desired end result is a csv output. What is the best approach for this?
Schema:
root
|-- Animal: struct (nullable = true)
| |-- Species: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- color: string (nullable = true)
| | | |-- unique_id: string (nullable = true)
Pseudo Code:
If name == "Beaver"
then get unique_id and put in dataframe column "beaver_id"
else
null in column "beaver_id"
If name == "Llama"
then get unique_id and put in dataframe column "llama_id"
else
null in column "llama_id"
If array of names does not contain "Llama" or "Beaver"
then null for both "beaver_id" and "llama_id"
Currently: I am using select dataframe function to select elements out of the input (parquet) to create a csv output. I am extracting many other elements using this approach besides the ones in this question.
var select_df = raw_df.select(
col(Animal.Species.name).getField("name")
col(Animal.Species.color).getField("color")
)
Example Input (in JSON), Actual input is parquet:
{
"Animal": {
"Species": [
{
"name": "Beaver",
"color": "red",
"unique_id": "1001"
},
{
"name": "Llama",
"color": "blue",
"unique_id": "2222"
}
]
}
}
Expected csv output:
beaver_id, llama_id
1001, 2222
CodePudding user response:
When you have to work with a large number of checks or with complex checks, it is always best to use a pattern strategy that simplifies this task, as well as adds scalability in case you want to add more checks.
In this sense, it seems more convenient to me that you create a context that determines what kind of "Species" is the record that you are verifying, in order to apply the logic you need, in this case, it would be the extraction of the required value and its subsequent assignment to the field determined by the context.
If I understood your question correctly, I think this would be the best approach for this task, let me know if my opinion fits your case and if you have any questions about it.
CodePudding user response:
You can use filter
function on Animal.Species
array column like this:
val select_df = raw_df.select(
element_at(expr("filter(Animal.Species, x -> x.name = 'Beaver')"), 1)
.getField("unique_id")
.as("beaver_id"),
element_at(expr("filter(Animal.Species, x -> x.name = 'Llama')"), 1)
.getField("unique_id")
.as("llama_id")
)
select_df.show
// --------- --------
//|beaver_id|llama_id|
// --------- --------
//| 1001| 2222|
// --------- --------
The logic is quite simple, we filter the array to find the inner struct which holds name = Beaver|Llama
and get its unique_id
. If not found null
is returned.
Note that since Spark 3, you can also use the filter
function within DataFrame API.