Home > OS >  How do I create new columns based on the first occurrence in the previous group?
How do I create new columns based on the first occurrence in the previous group?

Time:12-17

I have a dataframe that looks like below

id   reg   version
 1   54       1
 2   54       1
 3   54       1
 4   54       2
 5   54       3
 6   54       3
 7   55       1

The goal is to assign two new columns previous_version and next_version that takes the values from id's and populate in previous version and the next version. In the above example df, for id = 1, since the version = 1 the next version starts from the id = 4, I populated next_version value as 4 and previous_version as null since there isn't any.

if there is no previous or next version, it should populate with null.

What I was able to achieve is to

  • get the unique versions in the df.
  • the first id of each version.
  • Dictionary of the above 2.

I am struggling to come with a logic to apply that dictionary to the dataframe so I can populate previous_version and next_version columns.

versions = df['version'].unique().tolist()
version_ids = df.groupby(['reg', 'version'])['id'].first().tolist()

Here is how the data frame should look like

id   reg   version   previous_version   next_version
 1   54       1            NULL             4
 2   54       1            NUll             4
 3   54       1            NULL             4
 4   54       2            1                5
 5   54       3            4                NULL
 6   54       3            4                NULL
 7   55       1            NULL             NULL

What would be the best way to achieve the result if n versions are there?

CodePudding user response:

You could do a nested groupby, use Series.shift(), then merge.

def _prev_and_next_version(sf):
    # Use `Int64` to avoid conversion to float. Not crucial.
    first_id_by_version = sf.groupby('version')['id'].first().astype('Int64')
    prev = first_id_by_version.shift(1).rename('previous_version')
    next_ = first_id_by_version.shift(-1).rename('next_version')
    sf_out = sf.merge(prev, on='version').merge(next_, on='version')
    return sf_out

df.groupby('reg').apply(_prev_and_next_version).reset_index(drop=True)

Result:

   id  reg  version  previous_version  next_version
0   1   54        1              <NA>             4
1   2   54        1              <NA>             4
2   3   54        1              <NA>             4
3   4   54        2                 1             5
4   5   54        3                 4          <NA>
5   6   54        3                 4          <NA>
6   7   55        1              <NA>          <NA>

For context, first_id_by_version for each iteration:

version
1    1
2    4
3    5
Name: id, dtype: Int64

version
1    7
Name: id, dtype: Int64

CodePudding user response:

You can do by creating a map of the combination of reg, version as key and value as id. The series you need to use the map on are combination of reg, version 1 and reg, version - 1.

k1 = list(zip(df.reg, df.version.add(1)))
k2 = list(zip(df.reg, df.version.sub(1)))
d = df.drop_duplicates(['reg', 'version'], keep='first').set_index(['reg', 'version'])['id']
df['previous_version'] = pd.Series(k2).map(d).astype('Int64')
df['next_version'] = pd.Series(k1).map(d).astype('Int64')

print(df)

   id  reg  version  previous_version  next_version
0   1   54        1              <NA>             4
1   2   54        1              <NA>             4
2   3   54        1              <NA>             4
3   4   54        2                 1             5
4   5   54        3                 4          <NA>
5   6   54        3                 4          <NA>
6   7   55        1              <NA>          <NA>
   
  • Related