I have a dataframe that looks like below
id reg version
1 54 1
2 54 1
3 54 1
4 54 2
5 54 3
6 54 3
7 55 1
The goal is to assign two new columns previous_version and next_version that takes the values from id's and populate in previous version and the next version. In the above example df, for id = 1, since the version = 1 the next version starts from the id = 4, I populated next_version value as 4 and previous_version as null since there isn't any.
if there is no previous or next version, it should populate with null.
What I was able to achieve is to
- get the unique versions in the df.
- the first id of each version.
- Dictionary of the above 2.
I am struggling to come with a logic to apply that dictionary to the dataframe so I can populate previous_version and next_version columns.
versions = df['version'].unique().tolist()
version_ids = df.groupby(['reg', 'version'])['id'].first().tolist()
Here is how the data frame should look like
id reg version previous_version next_version
1 54 1 NULL 4
2 54 1 NUll 4
3 54 1 NULL 4
4 54 2 1 5
5 54 3 4 NULL
6 54 3 4 NULL
7 55 1 NULL NULL
What would be the best way to achieve the result if n versions are there?
CodePudding user response:
You could do a nested groupby, use Series.shift()
, then merge.
def _prev_and_next_version(sf):
# Use `Int64` to avoid conversion to float. Not crucial.
first_id_by_version = sf.groupby('version')['id'].first().astype('Int64')
prev = first_id_by_version.shift(1).rename('previous_version')
next_ = first_id_by_version.shift(-1).rename('next_version')
sf_out = sf.merge(prev, on='version').merge(next_, on='version')
return sf_out
df.groupby('reg').apply(_prev_and_next_version).reset_index(drop=True)
Result:
id reg version previous_version next_version
0 1 54 1 <NA> 4
1 2 54 1 <NA> 4
2 3 54 1 <NA> 4
3 4 54 2 1 5
4 5 54 3 4 <NA>
5 6 54 3 4 <NA>
6 7 55 1 <NA> <NA>
For context, first_id_by_version
for each iteration:
version
1 1
2 4
3 5
Name: id, dtype: Int64
version
1 7
Name: id, dtype: Int64
CodePudding user response:
You can do by creating a map of the combination of reg, version as key and value as id. The series you need to use the map on are combination of reg, version 1 and reg, version - 1.
k1 = list(zip(df.reg, df.version.add(1)))
k2 = list(zip(df.reg, df.version.sub(1)))
d = df.drop_duplicates(['reg', 'version'], keep='first').set_index(['reg', 'version'])['id']
df['previous_version'] = pd.Series(k2).map(d).astype('Int64')
df['next_version'] = pd.Series(k1).map(d).astype('Int64')
print(df)
id reg version previous_version next_version
0 1 54 1 <NA> 4
1 2 54 1 <NA> 4
2 3 54 1 <NA> 4
3 4 54 2 1 5
4 5 54 3 4 <NA>
5 6 54 3 4 <NA>
6 7 55 1 <NA> <NA>