I tried to use this line of code :
import pandasql as ps
df2["transaction per 30 minutes"] = ps.sqldf("SELECT COUNT(*) FROM df OVER (PARTITION BY 'Card Number' ORDER BY 'Date' RANGE BETWEEN '30 minutes' PRECEDING AND '0 minutes' FOLLOWING)", locals())
df2["avg_transaction_30"] = ps.sqldf("SELECT AVG(Amount) FROM df OVER (PARTITION BY 'Card Number' ORDER BY Date RANGE BETWEEN '30 minutes' PRECEDING AND '0 minutes' FOLLOWING)")
NB : delete 'locals()' or replace by 'globals()' don't solve the problem.
But I have this error :
PandaSQLException: (sqlite3.OperationalError) near "(": syntax error
[SQL: SELECT COUNT(*) FROM df2 OVER (PARTITION BY 'Card Number' ORDER BY date RANGE BETWEEN '30 minutes' PRECEDING AND '0 minutes' FOLLOWING)]
(Background on this error at: )
http://sqlalche.me/e/14/e3q8 (Same error with Average)
I load my dataframe like this :
df2 = pd.read_csv('df.csv')
My file is in the same directory as my Jupiter notebook file. I follow the link, but I still don't understand what's wrong.
And this line work :
ps.sqldf("SELECT * FROM df2")
So I guess is not a problem with my pandasql
My initial goal with this code is :
I want to know how many transactions a card number do for every 30 min interval, and also calculate in average how many transactions a client do per 30 minutes.
For example, I want to have something like (order by Card Number) :
Card Number | interval | #transactions_30min | AVG_transactions_30min |
---|---|---|---|
536518******2108 | 2015-05-11 00:00:01 - 00:30:00 | 2 | 5 |
536518******2108 | 2015-05-11 00:30:01 - 01:00:00 | 8 | 5 |
536518******2197 | 2015-05-11 00:00:01 - 00:30:00 | 6 | 8 |
536518******2197 | 2015-05-11 00:30:01 - 01:00:00 | 10 | 8 |
My database look like this :
id | Card Number | Date | Amount | Fraud |
---|---|---|---|---|
0 | 536518******2108 | 2015-05-01 00:01:54 | 36.54 | No |
1 | 536518******2108 | 2015-05-01 00:03:46 | 36.54 | No |
2 | 453211******1239 | 2015-05-01 00:08:50 | 69.00 | No |
3 | 548827******1705 | 2015-05-01 00:27:00 | 193.43 | No |
4 | 531681******9778 | 2015-05-01 01:32:46 | 132.00 | No |
5 | 515117******4107 | 2015-05-01 02:10:26 | 161.00 | No |
6 | 432032******9111 | 2015-05-01 08:09:15 | 110.00 | No |
7 | 544540******7141 | 2015-05-01 08:30:16 | 159.50 | No |
8 | 554906******0358 | 2015-05-01 09:13:51 | 126.50 | Yes |
9 | 554906******0358 | 2015-05-01 09:15:28 | 126.50 | Yes |
So is it possible to use pandasql in my case ? If yes, how can I correct my request to make it work ?
CodePudding user response:
If you don't mind using pandas for all calculations, here is one approach:
df['Date'] = pd.to_datetime(df['Date'])
g = df.groupby(["Card Number", pd.Grouper(key='Date', freq='30min')], sort=False)
df_out = g['Amount'].agg(['count', 'mean']).add_prefix('transactions30min_').reset_index()
Result:
print(df_out)
Card Number Date transactions30min_count transactions30min_mean
0 536518******2108 2015-05-01 00:00:00 2 36.54
1 453211******1239 2015-05-01 00:00:00 1 69.00
2 548827******1705 2015-05-01 00:00:00 1 193.43
3 531681******9778 2015-05-01 01:30:00 1 132.00
4 515117******4107 2015-05-01 02:00:00 1 161.00
5 432032******9111 2015-05-01 08:00:00 1 110.00
6 544540******7141 2015-05-01 08:30:00 1 159.50
7 554906******0358 2015-05-01 09:00:00 2 126.50