Home > Software design >  How to group unassociated content
How to group unassociated content

Time:04-16

I have a hive table that records user behavior

like this

userid behavior timestamp url
1 view 1650022601 url1
1 click 1650022602 url2
1 click 1650022614 url3
1 view 1650022617 url4
1 click 1650022622 url5
1 view 1650022626 url7
2 view 1650022628 url8
2 view 1650022631 url9

About 400GB is added to the table every day.

I want to order by timestamp asc, then one 'view' is in a group between another 'view' like this table, the first 3 lines belong to a same group , then subtract the timestamps, like 1650022614 - 1650022601 as the view time.

How to do this?

i try lag and lead function, or scala like this

        val pairRDD: RDD[(Int, String)] = record.map(x => {
            if (StringUtil.isDateString(x.split("\\s ")(0))) {
                partition = partition   1
                (partition, x)
            } else {
                (partition, x)
            }
        })

or java like this

        LongAccumulator part = spark.sparkContext().longAccumulator("part");

        JavaPairRDD<Long, Row> pairRDD = spark.sql(sql).coalesce(1).javaRDD().mapToPair((PairFunction<Row, Long, Row>) row -> {
            if (row.getAs("event") == "pageview") {
                part.add(1L);
            }
        return new Tuple2<>(part.value(), row);
        });

but when a dataset is very large, this code just stupid.

save me plz

CodePudding user response:

If you use dataframe, you can build partition by using window that sum a column whose value is 1 when you change partition and 0 if you don't change partition.

You can transform a RDD to a dataframe with sparkSession.createDataframe() method as explained in this answer

Back to your problem. In you case, you change partition every time column behavior is equal to "view". So we can start with this condition:

import org.apache.spark.sql.functions.col

val df1 = df.withColumn("is_view", (col("behavior") === "view").cast("integer"))

You get the following dataframe:

 ------ -------- ---------- ---- ------- 
|userid|behavior|timestamp |url |is_view|
 ------ -------- ---------- ---- ------- 
|1     |view    |1650022601|url1|1      |
|1     |click   |1650022602|url2|0      |
|1     |click   |1650022614|url3|0      |
|1     |view    |1650022617|url4|1      |
|1     |click   |1650022622|url5|0      |
|1     |view    |1650022626|url7|1      |
|2     |view    |1650022628|url8|1      |
|2     |view    |1650022631|url9|1      |
 ------ -------- ---------- ---- ------- 

Then you use a window ordered by timestamp to sum over the is_view column:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.sum

val df2 = df1.withColumn("partition", sum("is_view").over(Window.partitionBy("userid").orderBy("timestamp")))

Which get you the following dataframe:

 ------ -------- ---------- ---- ------- --------- 
|userid|behavior|timestamp |url |is_view|partition|
 ------ -------- ---------- ---- ------- --------- 
|1     |view    |1650022601|url1|1      |1        |
|1     |click   |1650022602|url2|0      |1        |
|1     |click   |1650022614|url3|0      |1        |
|1     |view    |1650022617|url4|1      |2        |
|1     |click   |1650022622|url5|0      |2        |
|1     |view    |1650022626|url7|1      |3        |
|2     |view    |1650022628|url8|1      |1        |
|2     |view    |1650022631|url9|1      |2        |
 ------ -------- ---------- ---- ------- --------- 

Then, you just have to aggregate per userid and partition:

import org.apache.spark.sql.functions.{max, min}

val result = df2.groupBy("userid", "partition")
  .agg((max("timestamp") - min("timestamp")).as("duration"))

And you get the following results:

 ------ --------- -------- 
|userid|partition|duration|
 ------ --------- -------- 
|1     |1        |13      |
|1     |2        |5       |
|1     |3        |0       |
|2     |1        |0       |
|2     |2        |0       |
 ------ --------- -------- 

The complete scala code:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, max, min, sum}

val result = df
  .withColumn("is_view", (col("behavior") === "view").cast("integer"))
  .withColumn("partition", sum("is_view").over(Window.partitionBy("userid").orderBy("timestamp")))
  .groupBy("userid", "partition")
  .agg((max("timestamp") - min("timestamp")).as("duration"))
  • Related