I am having a scenario where I will be receiving data in csv files and there I need to generate some columns with the existing one. Example:
Col_1 Col_2 Col_3 Col_4
abc 1 No 123
xyz 2 Yes 123
def 1 Yes 345
Expected:
Col_1 Col_2 Col_3 Col_4 Col_5 Col_6
abc 1 No 123 1 1
xyz 2 Yes 123 0 0
def 1 Yes 345 0 0
Col_5 Condition : if Col_1 = 'abc' then 1 else 0 end Col_6 Condition : max(Col_5) over (Col_2)
I know we can perform transformations in Druid when we loading the file in it, I tried simpler condition which is working fine for me, but I am Pretty doubt to perform aggregate and other transformation like Col_6 here.
Also we need to perform aggregate on different files data which we going to receive, Assume we get 2 file today and we loaded the data to Druid table, Tomorrow again we got some 3 files which is having data for same (ID) which is Col_2 here then we need to do aggregation based on all the records we have, Example : Col_6 generation here...
Shall this will be possible in Druid?
CodePudding user response:
Col_5 Condition : if Col_1 = 'abc' then 1 else 0
You can use the following
df = df.withColumn('Col_5', f.when((f.col('Col_1') == 'abc'), 1).otherwise(0))
Col_6 Condition : max(Col_5) over (Col_2)
You can apply window operation
windowSpec = Window.partitionBy("Col_2").orderBy("Col_5").desc()
df_max = df.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
Now remove duplicates for each Col_2
and then join the df_max
with your main df.
The above code snippet is in python, but spark API is the same so you can use it with minimal changes.
CodePudding user response:
The first type,if Col_1 = 'abc' then 1 else 0
, would not be hard. Eg, see this article with similar examples.
The second, aggregating over one of the columns, doesn't sound possible. We can aggregate over all the dimensions taken together (like a primary key), but not over one single dimension, afaik.
CodePudding user response:
Take a look at https://druid.apache.org/docs/latest/misc/math-expr.html which contains many transform expressions you can use.
In particular, I tested your use case with the wikipedia demo data by creating the following expressions:
{
"type": "expression",
"name": "isNB",
"expression": "case_simple(\"namespace\", 'Main',1,0)"
},
{
"type": "expression",
"expression": "greatest( case_simple(\"IsNew\", True, 1, 0), case_simple(\"namespace\", 'Main',1,0)",
"name": "combined_calc"
}
One thing to note is that transform expressions cannot refer to other transform expressions, so calculations need to all be done from the raw input fields.