Home > Software engineering >  insert first row to each group in pandas dataframe
insert first row to each group in pandas dataframe

Time:12-13

I have a large csv file containing the historic prices of stocks. This is a small sample of it:

data = pd.DataFrame({'sym': {0: 'msft', 1: 'msft', 2: 'tsla', 3: 'tsla', 4: 'bac', 5: 'bac'}, 'date': {0: '12/7/2021', 1: '12/6/2021', 2: '12/7/2021', 3: '12/6/2021', 4: '12/7/2021', 5: '12/6/2021'}, 'high': {0: 11, 1: 13, 2: 898, 3: 900, 4: 12, 5: 13}})

Now on each day there will be an update for this data and I want to append them to the data above. The updates look like this:

update = pd.DataFrame({'sym': {0: 'msft', 1: 'tsla', 2: 'bac'}, 'date': {0: '12/8/2021', 1: '12/8/2021', 2: '12/8/2021'}, 'high': {0: 16, 1: 1000, 2: 14}})

What I want is the dataframe below:

result = pd.DataFrame({'sym': {0: 'msft', 1: 'msft', 2: 'msft', 3: 'tsla', 4: 'tsla', 5: 'tsla', 6: 'bac', 7: 'bac', 8: 'bac'}, 'date': {0: '12/8/2021', 1: '12/7/2021', 2: '12/6/2021', 3: '12/8/2021', 4: '12/7/2021', 5: '12/6/2021', 6: '12/8/2021', 7: '12/7/2021', 8: '12/6/2021'}, 'high': {0: 16, 1: 11, 2: 13, 3: 1000, 4: 898, 5: 900, 6: 14, 7: 12, 8: 13}})

My current approach is using this code:

data = data.append(update)
data = data.sort_values(by=['sym', 'date'])

By tweaking the above approach I can achieve what I want but since I have million rows in my database, I was wondering if there is a faster way other than using sort_values.

CodePudding user response:

result=pd.merge_ordered(data,update,on=['date','high'],left_by='sym',fill_method='ffill').drop(['sym_x','sym_y'],axis=1)

CodePudding user response:

IIUC, you want to keep the order of sym as it appears in data while sorting data in descending order. You can do that by converting sym-column to category and setting its category order by the order it appears in data. Then simply sort_values by ['sym','date']:

sorter = data['sym'].drop_duplicates()
out = data.append(update)
out['sym'] = out['sym'].astype("category").cat.set_categories(sorter)
out = out.sort_values(by=['sym','date'], ascending=['sym',False]).reset_index(drop=True)

Output:

    sym       date  high
0  msft  12/8/2021    16
1  msft  12/7/2021    11
2  msft  12/6/2021    13
3  tsla  12/8/2021  1000
4  tsla  12/7/2021   898
5  tsla  12/6/2021   900
6   bac  12/8/2021    14
7   bac  12/7/2021    12
8   bac  12/6/2021    13
  • Related