I have a huge amount of data where I have a column with multiple products and two signals "ID" & "Value". The signal "ID" increases from 1-40 several times a day. For each ID I have a value. In this example the data comes in 3 Hz and I only write down the IDs from 1-3 to make the table not to long.
Product | Date | ID | Value |
---|---|---|---|
A | 20.10.2021T20:01:01 | 1 | 32 |
A | 20.10.2021T20:01:01 | 1 | 32 |
A | 20.10.2021T20:01:01 | 1 | 32 |
A | 20.10.2021T20:01:03 | 3 | 52 |
A | 20.10.2021T20:01:03 | 3 | 52 |
A | 20.10.2021T20:01:04 | 1 | 32 |
A | 20.10.2021T20:01:04 | 1 | 32 |
A | 20.10.2021T20:01:04 | 1 | 32 |
A | 20.10.2021T20:01:05 | 2 | 40 |
A | 20.10.2021T20:01:05 | 2 | 40 |
A | 20.10.2021T20:01:06 | 3 | 52 |
B | 20.10.2021T16:12:06 | 1 | 20 |
B | 20.10.2021T16:12:06 | 1 | 20 |
B | 20.10.2021T16:12:06 | 1 | 20 |
B | 20.10.2021T16:12:07 | 2 | 68 |
B | 20.10.2021T16:12:07 | 2 | 68 |
B | 20.10.2021T16:12:07 | 2 | 68 |
B | 20.10.2021T16:12:08 | 3 | 41 |
B | 20.10.2021T16:12:08 | 3 | 41 |
B | 20.10.2021T16:12:08 | 3 | 41 |
B | 20.12.2021T16:12:09 | 1 | 20 |
B | 20.10.2021T16:12:11 | 3 | 41 |
B | 20.10.2021T16:12:11 | 3 | 41 |
The problem is that in some parts of the data the signal is not recorded or I get garbage entries. Therefore I only want the first data per day where it seems plausible. It should look something like this:
Product | Date | ID | Value |
---|---|---|---|
A | 20.10.2021T20:01:04 | 1 | 32 |
A | 20.10.2021T20:01:05 | 2 | 40 |
A | 20.10.2021T20:01:06 | 3 | 52 |
B | 20.10.2021T16:12:06 | 1 | 20 |
B | 20.10.2021T16:12:07 | 2 | 68 |
B | 20.10.2021T16:12:08 | 3 | 41 |
Regarding the data size that I have I'm also looking for an effcient way to solve this problem.
CodePudding user response:
You can try spark function groupby
then first
from pyspark.sql import functions as F
df.groupby("ID", "Value").agg(F.first("product"), F.first("Date")).show(10, False)
Output:
--- ----- -------------- -------------------
|ID |Value|first(product)|first(Date) |
--- ----- -------------- -------------------
|2 |40 |A |20.10.2021T20:01:05|
|2 |68 |B |20.10.2021T16:12:07|
|3 |52 |A |20.10.2021T20:01:03|
|1 |20 |B |20.10.2021T16:12:06|
|1 |32 |A |20.10.2021T20:01:01|
|3 |41 |B |20.10.2021T16:12:08|
--- ----- -------------- -------------------
CodePudding user response:
In order to make your question more understandable for future readers, I have decided to make a small replica or minimal reproducible example to your dataframe with the total of 60
entries for each column: "product, date, id, value"
, so this way you can see for example the behavior using groupby('ID', 'Products', 'Date').agg(F.first('Values'))
and other functions or ordering algorithms that you can apply for the id
and value
columns, I hope it can help you.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import numpy as np
import pandas as pd
import random
import datetime
spark = SparkSession()
# value just have random numbers between 1-99
value = [random.randint(1, random.randint(1, 99)) for _ in range(60)]
# data recorded for 60 entries or 1 (one) minute
date = [datetime.datetime(2021, 10, 20, 20, 1, i).isoformat() for i in range(60)]
product = [['A', 'B' ][random.randint(0, 1)] for i in range(0, 60)]
_id = [random.randint(1, 3) for _ in range(60)]
data = [[product[i], date[i], id[i], value[i]] for i in range(60)]
columns = ["Products","Date", "ID", "Values"]
# dataframe
df = spark.createDataFrame(pd.DataFrame(data=data, columns=columns))
>>> df.groupby('ID', 'Products' , 'Date').agg(F.first('Values')).show()
--- -------- ------------------- -------------
| ID|Products| Date|first(Values)|
--- -------- ------------------- -------------
| 1| A|2021-10-20T20:01:35| 39|
| 2| B|2021-10-20T20:01:00| 6|
| 3| A|2021-10-20T20:01:20| 29|
| 3| B|2021-10-20T20:01:45| 87|
| 3| A|2021-10-20T20:01:47| 17|
| 3| B|2021-10-20T20:01:03| 1|
| 1| A|2021-10-20T20:01:38| 44|
| 2| B|2021-10-20T20:01:46| 15|
| 3| A|2021-10-20T20:01:51| 15|
| 2| A|2021-10-20T20:01:25| 69|
| 3| B|2021-10-20T20:01:56| 22|
| 2| A|2021-10-20T20:01:40| 28|
| 1| A|2021-10-20T20:01:53| 47|
| 1| A|2021-10-20T20:01:32| 7|
| 2| B|2021-10-20T20:01:05| 33|
| 2| B|2021-10-20T20:01:59| 8|
| 1| A|2021-10-20T20:01:15| 48|
| 3| A|2021-10-20T20:01:26| 29|
| 1| A|2021-10-20T20:01:09| 28|
| 3| B|2021-10-20T20:01:29| 53|
--- -------- ------------------- -------------