I try to figure out an efficient way for a big dataset to deal with the following: The data contains multiple rows per day with specified codes (string) and ratings as columns. I try to create a new dataset with columns for all the strings in this list; string=['239', '345', '346'] and the new dataset should contain the mean value of the rating on each day. So that I get a time series of means of the specified numbers.
This would be a simple example dataset:
df1 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
'Code':['P:346 K,329 28', 'N2:345 P239', 'P:346 K2', 'E32 345', 'Q2_325', 'P;235 K345', '2W345', 'Pq-245 3460239'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5]})
I try to achieve something similar to that table, but so far I wasn't able to get it done efficiently.
strings = ['239', '345', '346']
df2 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-02', '2021-01-03'],
'239':[8.5, 'NA', '5'],
'345':[8, 4, 'NA'],
'346':[7, 'NA', 5],})
Thank you very much for your help:)
CodePudding user response:
IIUC you can extract
the strings in the code
column and then pivot
:
print (df1.assign(Code=df1["Code"].str.extractall(f"({'|'.join(strings)})").groupby(level=0).agg(tuple))
.explode("Code")
.pivot_table(index="Date", columns="Code", values="Ratings", aggfunc="mean"))
Code 239 345 346
Date
2021-01-01 8.5 8.0 7.0
2021-01-02 NaN 4.0 NaN
2021-01-03 5.0 NaN 5.0