Can we able to do Transformation in Druid-CodePudding

I am having a scenario where I will be receiving data in csv files and there I need to generate some columns with the existing one. Example:

Col_1   Col_2    Col_3   Col_4
abc     1        No      123
xyz     2        Yes     123
def     1        Yes     345

Expected:

Col_1    Col_2    Col_3   Col_4   Col_5   Col_6
abc      1        No      123     1       1
xyz      2        Yes     123     0       0
def      1        Yes     345     0       0

Col_5 Condition : if Col_1 = 'abc' then 1 else 0 end Col_6 Condition : max(Col_5) over (Col_2)

I know we can perform transformations in Druid when we loading the file in it, I tried simpler condition which is working fine for me, but I am Pretty doubt to perform aggregate and other transformation like Col_6 here.

Also we need to perform aggregate on different files data which we going to receive, Assume we get 2 file today and we loaded the data to Druid table, Tomorrow again we got some 3 files which is having data for same (ID) which is Col_2 here then we need to do aggregation based on all the records we have, Example : Col_6 generation here...

Shall this will be possible in Druid?

CodePudding user response：

Col_5 Condition : if Col_1 = 'abc' then 1 else 0

You can use the following

df = df.withColumn('Col_5', f.when((f.col('Col_1') == 'abc'), 1).otherwise(0))

Col_6 Condition : max(Col_5) over (Col_2)

You can apply window operation

windowSpec = Window.partitionBy("Col_2").orderBy("Col_5").desc()

df_max = df.withColumn("row_number", row_number().over(windowSpec)).filter(
    f.col("row_number") == 1
)

Now remove duplicates for each Col_2 and then join the df_max with your main df.

The above code snippet is in python, but spark API is the same so you can use it with minimal changes.

CodePudding user response：

The first type,if Col_1 = 'abc' then 1 else 0, would not be hard. Eg, see this article with similar examples.

The second, aggregating over one of the columns, doesn't sound possible. We can aggregate over all the dimensions taken together (like a primary key), but not over one single dimension, afaik.

CodePudding user response：

Take a look at https://druid.apache.org/docs/latest/misc/math-expr.html which contains many transform expressions you can use.

In particular, I tested your use case with the wikipedia demo data by creating the following expressions:

    {
            "type": "expression",
            "name": "isNB",
            "expression": "case_simple(\"namespace\", 'Main',1,0)"
          },

          {
            "type": "expression",
            "expression": "greatest( case_simple(\"IsNew\", True, 1, 0), case_simple(\"namespace\", 'Main',1,0)",
            "name": "combined_calc"
          }

One thing to note is that transform expressions cannot refer to other transform expressions, so calculations need to all be done from the raw input fields.