Home > Enterprise >  Create new dataset with columns of specific strings with daily average
Create new dataset with columns of specific strings with daily average

Time:10-23

I try to figure out an efficient way for a big dataset to deal with the following: The data contains multiple rows per day with specified codes (string) and ratings as columns. I try to create a new dataset with columns for all the strings in this list; string=['239', '345', '346'] and the new dataset should contain the mean value of the rating on each day. So that I get a time series of means of the specified numbers.

This would be a simple example dataset:

df1 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
    'Code':['P:346 K,329 28', 'N2:345 P239', 'P:346 K2', 'E32 345', 'Q2_325', 'P;235 K345', '2W345', 'Pq-245 3460239'], 
    'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5]})  

I try to achieve something similar to that table, but so far I wasn't able to get it done efficiently.

strings = ['239', '345', '346']
df2 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03'],
    '239':[8.5, 'NA', '5'], 
    '345':[8, 4, 'NA'],
    '346':[7, 'NA', 5],})

Thank you very much for your help:)

CodePudding user response:

IIUC you can extract the strings in the code column and then pivot:

print (df1.assign(Code=df1["Code"].str.extractall(f"({'|'.join(strings)})").groupby(level=0).agg(tuple))
          .explode("Code")
          .pivot_table(index="Date", columns="Code", values="Ratings", aggfunc="mean"))

Code        239  345  346
Date                     
2021-01-01  8.5  8.0  7.0
2021-01-02  NaN  4.0  NaN
2021-01-03  5.0  NaN  5.0
  • Related