I'm currently facing a problem, that I can't solve. I spent 6hours trying to find a solution but in the end nothing worked for me, probably because I don't use the wright things. (I'm using python, pandas, numpy)
Imagine that I have 2 dataframes that are the same except that the second one has 5 day less than the other for each cluster. Where "day" and "cluster" are column names which are sorted. And each cluster has a different number of days.
Graphically the situation is: https://i.stack.imgur.com/w8wDk.jpg
Now I want to merge / concatenate in such a way that my dataframe not merge depending on index. I want the first rows of the second dataframe to match the first rows of the first dataframe. Consequently it will induce NA values for the 5 last rows of the second dataframe in the merged one.
Graphically the situation will be: https://i.stack.imgur.com/nFWHa.jpg
How can I proceed to fix this situation ?
Thank you in advance for any kind of help, I've already tried lot of things, I'm really struggling with this problem.
CodePudding user response:
I admit that this is not the prettiest solution, but at least it works. Assuming that the taller and shorter frames are f1
and f2
, respectively, the steps are
- Create a "fake" frame
f
with the same height asf1
but with nocluster
column. - Gradually fill
f
at relevant indices taken fromf1
with data fromf2
. - Concatenate the (partially filled)
f
withf1
To demonstrate this idea, let's say that the two frames are
>>> f1
cluster day A B
0 2 0 1 2
1 2 1 3 4
2 1 2 5 6
3 1 3 7 8
>>> f2
cluster day A B
0 1 5 10 20
1 1 9 30 40
2 2 6 50 60
The code is as follows (where np
is numpy
)
f = f1.drop('cluster', axis=1).copy() # the fake frame
f[:] = np.nan
f1g = f1.groupby('cluster') # Allow for a second indexing way using cluster id
f2g = f2.groupby('cluster')
clusters1 = f1g.groups.keys()
clusters2 = f2g.groups.keys()
for cluster in (clusters1 & clusters2):
idx1 = f1g.get_group(cluster).index # indices of entries of the current cluster in f1
idx2 = f2g.get_group(cluster).index # indices of entries of the current cluster in f2
m = len(idx2)
f.loc[idx1[0:m]] = f2.loc[idx2[0:m], ['day', 'A', 'B']].to_numpy() # fill the first m entries of current cluster in f with data from f2
And the result after concatenating the fake f
and the taller f1
>>> pd.concat([f1, f], axis=1)
cluster day A B day A B
0 2 0 1 2 6.0 50.0 60.0
1 2 1 3 4 NaN NaN NaN
2 1 2 5 6 5.0 10.0 20.0
3 1 3 7 8 9.0 30.0 40.0
Final note: You can obtain idx1
and idx2
in the for loop using ways other than groupby
, but I think the latter is one of the fastest ways to do this.