Home > Software engineering >  Pyspark: Effcient way for searching correct and plausible measurement data and take them as a new Da
Pyspark: Effcient way for searching correct and plausible measurement data and take them as a new Da

Time:10-24

I have a huge amount of data where I have a column with multiple products and two signals "ID" & "Value". The signal "ID" increases from 1-40 several times a day. For each ID I have a value. In this example the data comes in 3 Hz and I only write down the IDs from 1-3 to make the table not to long.

Product Date ID Value
A 20.10.2021T20:01:01 1 32
A 20.10.2021T20:01:01 1 32
A 20.10.2021T20:01:01 1 32
A 20.10.2021T20:01:03 3 52
A 20.10.2021T20:01:03 3 52
A 20.10.2021T20:01:04 1 32
A 20.10.2021T20:01:04 1 32
A 20.10.2021T20:01:04 1 32
A 20.10.2021T20:01:05 2 40
A 20.10.2021T20:01:05 2 40
A 20.10.2021T20:01:06 3 52
B 20.10.2021T16:12:06 1 20
B 20.10.2021T16:12:06 1 20
B 20.10.2021T16:12:06 1 20
B 20.10.2021T16:12:07 2 68
B 20.10.2021T16:12:07 2 68
B 20.10.2021T16:12:07 2 68
B 20.10.2021T16:12:08 3 41
B 20.10.2021T16:12:08 3 41
B 20.10.2021T16:12:08 3 41
B 20.12.2021T16:12:09 1 20
B 20.10.2021T16:12:11 3 41
B 20.10.2021T16:12:11 3 41

The problem is that in some parts of the data the signal is not recorded or I get garbage entries. Therefore I only want the first data per day where it seems plausible. It should look something like this:

Product Date ID Value
A 20.10.2021T20:01:04 1 32
A 20.10.2021T20:01:05 2 40
A 20.10.2021T20:01:06 3 52
B 20.10.2021T16:12:06 1 20
B 20.10.2021T16:12:07 2 68
B 20.10.2021T16:12:08 3 41

Regarding the data size that I have I'm also looking for an effcient way to solve this problem.

CodePudding user response:

You can try spark function groupby then first

from pyspark.sql import functions as F
df.groupby("ID", "Value").agg(F.first("product"), F.first("Date")).show(10, False)

Output:

 --- ----- -------------- ------------------- 
|ID |Value|first(product)|first(Date)        |
 --- ----- -------------- ------------------- 
|2  |40   |A             |20.10.2021T20:01:05|
|2  |68   |B             |20.10.2021T16:12:07|
|3  |52   |A             |20.10.2021T20:01:03|
|1  |20   |B             |20.10.2021T16:12:06|
|1  |32   |A             |20.10.2021T20:01:01|
|3  |41   |B             |20.10.2021T16:12:08|
 --- ----- -------------- ------------------- 

CodePudding user response:

In order to make your question more understandable for future readers, I have decided to make a small replica or minimal reproducible example to your dataframe with the total of 60 entries for each column: "product, date, id, value", so this way you can see for example the behavior using groupby('ID', 'Products', 'Date').agg(F.first('Values')) and other functions or ordering algorithms that you can apply for the id and value columns, I hope it can help you.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import numpy as np
import pandas as pd
import random
import datetime

spark = SparkSession()

# value just have random numbers between 1-99
value = [random.randint(1, random.randint(1, 99)) for _ in range(60)]

# data recorded for 60 entries or 1 (one) minute
date = [datetime.datetime(2021, 10, 20, 20, 1, i).isoformat() for i in range(60)]
 
product = [['A', 'B' ][random.randint(0, 1)] for i in range(0, 60)] 
_id = [random.randint(1, 3) for _ in range(60)]
data = [[product[i], date[i], id[i], value[i]] for i in range(60)]
columns = ["Products","Date", "ID", "Values"]

# dataframe
df = spark.createDataFrame(pd.DataFrame(data=data, columns=columns))
>>> df.groupby('ID', 'Products' , 'Date').agg(F.first('Values')).show()
 --- -------- ------------------- ------------- 
| ID|Products|               Date|first(Values)|
 --- -------- ------------------- ------------- 
|  1|       A|2021-10-20T20:01:35|           39|
|  2|       B|2021-10-20T20:01:00|            6|
|  3|       A|2021-10-20T20:01:20|           29|
|  3|       B|2021-10-20T20:01:45|           87|
|  3|       A|2021-10-20T20:01:47|           17|
|  3|       B|2021-10-20T20:01:03|            1|
|  1|       A|2021-10-20T20:01:38|           44|
|  2|       B|2021-10-20T20:01:46|           15|
|  3|       A|2021-10-20T20:01:51|           15|
|  2|       A|2021-10-20T20:01:25|           69|
|  3|       B|2021-10-20T20:01:56|           22|
|  2|       A|2021-10-20T20:01:40|           28|
|  1|       A|2021-10-20T20:01:53|           47|
|  1|       A|2021-10-20T20:01:32|            7|
|  2|       B|2021-10-20T20:01:05|           33|
|  2|       B|2021-10-20T20:01:59|            8|
|  1|       A|2021-10-20T20:01:15|           48|
|  3|       A|2021-10-20T20:01:26|           29|
|  1|       A|2021-10-20T20:01:09|           28|
|  3|       B|2021-10-20T20:01:29|           53|
 --- -------- ------------------- ------------- 
  • Related