Home > Software design >  How to create and assign indexes for each group in a dataframe
How to create and assign indexes for each group in a dataframe

Time:04-30

[This is DataFrame loaded with data from an Excel file]

   STUDY              Teacher       UPDATE_DATE
0   math                    A        2022-02-25
1   math                 A, C        2022-02-25
2   science                 D        2022-01-03
3   science                 A        2022-01-03
4   science              A, B        2022-02-20
5   entertainment           C        2022-01-01
6   entertainment        B, C        2022-03-02
7   technology              E        2021-09-01

Q1. I want to create an index_number field based on count per STUDY and add count per STUDY group to this field from 0. I want to add an empty or null value if count per group is 1.

   STUDY              Teacher       UPDATE_DATE   INDEX_NUMBER
0   math                    A        2022-02-25           0
1   math                 A, C        2022-02-25           1
2   science                 D        2022-01-03           0
3   science                 A        2022-01-03           1
4   science              A, B        2022-02-20           2
5   entertainment           C        2022-01-01           0
6   entertainment        B, C        2022-03-02           1
7   technology              E        2021-09-01        null  

The code below doesn't work. Something seems to be going in the wrong direction a lot.

for index in df.index:
    data_row = df.iloc[index, :]

    for sent_id in range(0, len(df.groupby(data_row['STUDY']).count())):
        df['INDEX_NUMBER'] = sent_id

If the above problem is solved, create a COMPARE_DATE field and set the '2022-01-01' date if the INDEX is 0, the '2022-04-10' date for the last index, and the next I want to pull the UPDATE_DATE of the index. The final desired dataframe looks like this. Your help will help improve the code quality. thank you.

   STUDY              Teacher       UPDATE_DATE   INDEX_NUMBER COMPARE_DATE
0   math                    A        2022-02-25           0      2022-01-01
1   math                 A, C        2022-02-25           1      2022-04-10
2   science                 D        2022-01-03           0      2022-01-01
3   science                 A        2022-01-03           1      2022-02-20
4   science              A, B        2022-02-20           2      2022-04-10
5   entertainment           C        2022-01-01           0      2022-01-01
6   entertainment        B, C        2022-03-02           1      2022-04-10
7   technology              E        2021-09-01        null         null

CodePudding user response:

First of all, in order to test it, will create the dataframe that you mention.

import pandas as pd

df = pd.DataFrame( { 'STUDY': ['math', 'math', 'science', 'science', 'science', 'entertainment', 'entertainment', 'technology'], 'Teacher': ['A', 'A, C', 'D', 'A', 'A, B', 'C', 'B, C', 'E'], 'UPDATE_DATE': ['2022-02-25', '2022-02-25', '2022-01-03', '2022-01-03', '2022-02-20', '2022-01-01', '2022-03-02', '2021-09-01'] } )

Now, for the first part of the challenge, the creation of the column named INDEX_NUMBER, the following will do the work

df['INDEX_NUMBER'] = df.groupby('STUDY')['STUDY'].transform(lambda x: x.rank(method='first'))

If one prints this is the current state of the dataframe

print(df)

[Out]: 
           STUDY Teacher UPDATE_DATE  INDEX_NUMBER
0           math       A  2022-02-25           1.0
1           math    A, C  2022-02-25           2.0
2        science       D  2022-01-03           1.0
3        science       A  2022-01-03           2.0
4        science    A, B  2022-02-20           3.0
5  entertainment       C  2022-01-01           1.0
6  entertainment    B, C  2022-03-02           2.0
7     technology       E  2021-09-01           1.0

Note that instead of starting the INDEX_NUMBER with 0, I've started with 1. But one will take this in consideration for the next part.

Now, for the last part of the question, my suggestion is for you to think more carefully on the requirements and, if needed, post a new question. There are a few things that you might want to consider, (see this comment).

With the current state of things, one way that you can use to solve your particular issue is by using this function (not the most elegant, but it does the work)

def get_compare_date(x):
    if x['STUDY'] == 'math':
        if x['INDEX_NUMBER'] == 1:
            return '2022-01-01'
        else:
            return '2022-04-10'
    elif x['STUDY'] == 'science':
        if x['INDEX_NUMBER'] == 1:
            return '2022-01-01'
        elif x['INDEX_NUMBER'] == 2:
            return x['UPDATE_DATE']
        else:
            return '2022-04-10'
    elif x['STUDY'] == 'entertainment':
        if x['INDEX_NUMBER'] == 1:
            return '2022-01-01'
        else:
            return '2022-04-10'
    else:
        if x['INDEX_NUMBER'] == 1:
            return '2022-01-01'
        else:
            return '2022-04-10'

And, from that one, one can create the column "COMPARE_DATE"

df['COMPARE_DATE'] = df.apply(get_compare_date, axis=1)

Which will output the following

print(df)

[Out]:
           STUDY Teacher UPDATE_DATE  INDEX_NUMBER COMPARE_DATE
0           math       A  2022-02-25             1   2022-01-01
1           math    A, C  2022-02-25             2   2022-04-10
2        science       D  2022-01-03             1   2022-01-01
3        science       A  2022-01-03             2   2022-01-03
4        science    A, B  2022-02-20             3   2022-04-10
5  entertainment       C  2022-01-01             1   2022-01-01
6  entertainment    B, C  2022-03-02             2   2022-04-10
7     technology       E  2021-09-01             1   2022-01-01

This may be enough to solve what you are looking for, but may not be.

CodePudding user response:

You can add INDEX_NUMBER and COMPARE_DATE by applying a customized function:

def compare_date(group):
    index = range(len(group))
    if len(group) > 2:
        col = ['2022-01-01'] group.iloc[1:len(group)-1]['UPDATE_DATE'].tolist() ['2022-04-10']
    elif len(group) == 2:
        col = ['2022-01-01']   ['2022-04-10']
    else:
        col = [pd.NA]
        index = [pd.NA]
    group = group.assign(INDEX_NUMBER=index, COMPARE_DATE=col)
    return group


out = df.groupby('STUDY').apply(compare_date)
print(out)

           STUDY Teacher UPDATE_DATE INDEX_NUMBER COMPARE_DATE
0           math       A  2022-02-25            0   2022-01-01
1           math    A, C  2022-02-25            1   2022-04-10
2        science       D  2022-01-03            0   2022-01-01
3        science       A  2022-01-03            1   2022-01-03
4        science    A, B  2022-02-20            2   2022-04-10
5  entertainment       C  2022-01-01            0   2022-01-01
6  entertainment    B, C  2022-03-02            1   2022-04-10
7     technology       E  2021-09-01         <NA>         <NA>

If you don't want the only 1 length group to be NA, there is a more easier method with cumcount

df['INDEX_NUMBER'] = df.groupby('STUDY').cumcount()
print(df)

           STUDY Teacher UPDATE_DATE  INDEX_NUMBER
0           math       A  2022-02-25             0
1           math    A, C  2022-02-25             1
2        science       D  2022-01-03             0
3        science       A  2022-01-03             1
4        science    A, B  2022-02-20             2
5  entertainment       C  2022-01-01             0
6  entertainment    B, C  2022-03-02             1
7     technology       E  2021-09-01             0
  • Related