Home > database >  Replicate row in Pandas dataframe based on condition and change values for a specific column
Replicate row in Pandas dataframe based on condition and change values for a specific column

Time:07-19

Start_Year   End_Year   Opp1              Opp2          Duration
1500         1501       ['A','B']        ['C','D']      1
1500         1510       ['P','Q','R']    ['X','Y']      10
1520         1520       ['A','X']        ['C']          0
...          ....        ........        .....          ..
1809         1820       ['M']            ['F','H','Z']  11

My dataset(csv file format) is of armed wars fought between different entities(countries, states, and factions represented by Capital letters A, B, P, Q etc as lists in Opp1(opposition) and Opp2 columns. Start_Year and End_Year are the years about when the war started and when it ended. The Duration column is created by subtracting values of End_Year to Start_Year.

I want to replicate those rows with Duration greater than 1 by the factor of the Duration of war i.e if duration is 6 years then replicate that row 6 times and decrease the Duration values by 1 and increase the Start_Year by 1 for every replication in replicated rows and keep the values in other columns same. My desired output column is like this:

I have no clue how to proceed with something like this as I am a beginner in data science and analysis. So pardon me for not showing any trial codes here.

Start_Year   End_Year   Opp1              Opp2          Duration
1500         1501       ['A','B']        ['C','D']      1
1501         1501       ['A','B']        ['C','D']      0
1500         1510       ['P','Q','R']    ['X','Y']      10
1501         1510       ['P','Q','R']    ['X','Y']      9
1502         1510       ['P','Q','R']    ['X','Y']      8
1503         1510       ['P','Q','R']    ['X','Y']      7
1504         1510       ['P','Q','R']    ['X','Y']      6
1505         1510       ['P','Q','R']    ['X','Y']      5
....         ....       .............    ........       ..
1510         1510       ['P','Q','R']    ['X','Y']      0
1520         1520       ['A','X']        ['C']          0
...          ....        ........        .....          ..
1809         1820       ['M']            ['F','H','Z']  11
1810         1820       ['M']            ['F','H','Z']  10
....         ....       .....            .............. ..

Edit:1 Some example dataset The Dataset

CodePudding user response:

You can use pandas.Index.repeat to repeat the rows [Duration times] based on column Duration and then using pandas.core.groupby.GroupBy.cumcount you can add increasing cumulative values to the start_year column.

Reading data

data = [[1500, 1501, ['A','B'], ['C','D'], 1],
        [1500, 1510, ['P','Q','R'], ['X','Y'], 10],
        [1520, 1520, ['A','X'], ['C'], 0],
        [1809, 1820, ['M'], ['F','H','Z'], 11]]
df = pd.DataFrame(data, columns = ['Start_Year', 'End_Year', 'Opp1', 'Opp2', 'Duration'])

Repeating the values

mask = df['Duration'].gt(0)
df1 = df[mask].copy()
df1 = df1.loc[df1.index.repeat(df1['Duration']   1)]

Assigning increasing values to each group

df1['Start_Year']  = df1[['Opp1', 'Opp2']].astype(str).groupby(['Opp1',  'Opp2']).cumcount()

Generating output

df1['Duration'] = df1['End_Year'] - df1['Start_Year']
df = pd.concat([df1, df[~mask]]).sort_index(key=sorted).reset_index(drop=True)

This gives us the expected output :

df
    Start_Year  End_Year       Opp1       Opp2  Duration
0         1500      1501     [A, B]     [C, D]         1
1         1500      1510  [P, Q, R]     [X, Y]        10
2         1501      1510  [P, Q, R]     [X, Y]         9
3         1502      1510  [P, Q, R]     [X, Y]         8
4         1503      1510  [P, Q, R]     [X, Y]         7
5         1504      1510  [P, Q, R]     [X, Y]         6
6         1505      1510  [P, Q, R]     [X, Y]         5
7         1506      1510  [P, Q, R]     [X, Y]         4
8         1507      1510  [P, Q, R]     [X, Y]         3
9         1508      1510  [P, Q, R]     [X, Y]         2
10        1509      1510  [P, Q, R]     [X, Y]         1
11        1520      1520     [A, X]        [C]         0
12        1810      1820        [M]  [F, H, Z]        10
13        1811      1820        [M]  [F, H, Z]         9
14        1812      1820        [M]  [F, H, Z]         8
15        1813      1820        [M]  [F, H, Z]         7
16        1814      1820        [M]  [F, H, Z]         6
17        1815      1820        [M]  [F, H, Z]         5
18        1816      1820        [M]  [F, H, Z]         4
19        1817      1820        [M]  [F, H, Z]         3
20        1818      1820        [M]  [F, H, Z]         2
21        1819      1820        [M]  [F, H, Z]         1
22        1809      1820        [M]  [F, H, Z]        11

Alternatively

You can also try the other way around after Repeating the values by assigning Duration in first decreasing cumulatively. And then calculating the 'Start_Year' again

df1['Duration'] = df1[['Opp1', 'Opp2']].astype(str).groupby(['Opp1',  'Opp2']).cumcount(ascending=False)   1
df1['Start_Year'] = df1['End_Year'] - df1['Duration']
df = pd.concat([df1, df[~mask]]).sort_index().reset_index(drop=True)

Output :

This gives you same expected output:

    Start_Year  End_Year       Opp1       Opp2  Duration
0         1500      1501     [A, B]     [C, D]         1
1         1500      1510  [P, Q, R]     [X, Y]        10
2         1501      1510  [P, Q, R]     [X, Y]         9
3         1502      1510  [P, Q, R]     [X, Y]         8
4         1503      1510  [P, Q, R]     [X, Y]         7
5         1504      1510  [P, Q, R]     [X, Y]         6
6         1505      1510  [P, Q, R]     [X, Y]         5
7         1506      1510  [P, Q, R]     [X, Y]         4
8         1507      1510  [P, Q, R]     [X, Y]         3
9         1508      1510  [P, Q, R]     [X, Y]         2
10        1509      1510  [P, Q, R]     [X, Y]         1
11        1520      1520     [A, X]        [C]         0
12        1810      1820        [M]  [F, H, Z]        10
13        1811      1820        [M]  [F, H, Z]         9
14        1812      1820        [M]  [F, H, Z]         8
15        1813      1820        [M]  [F, H, Z]         7
16        1814      1820        [M]  [F, H, Z]         6
17        1815      1820        [M]  [F, H, Z]         5
18        1816      1820        [M]  [F, H, Z]         4
19        1817      1820        [M]  [F, H, Z]         3
20        1818      1820        [M]  [F, H, Z]         2
21        1819      1820        [M]  [F, H, Z]         1
22        1809      1820        [M]  [F, H, Z]        11

You can reset the index using pandas.DataFrame.reset_index.

Summary :

Basically, what we have done here is duplicated rows based on value from column Duration with condition.

We saved the rows which could have got vanished on using pandas.Index.repeat to repeat the rows [Duration value times] and once we replicated and applied logic on the rows with Duration > 0 replacing column values by subsequent increasing/decreasing cumulative values using pandas.core.groupby.GroupBy.cumcount we concatenated both the dataframe and sorted them on index using pandas.DataFrame.sort_index since the index was also supposed to be repeated when we used pandas.Index.repeat to repeat the rows [Duration value times]. Hence the sort on index would give us the dataframe in same order as it was in the original dataframe.

CodePudding user response:

Almost the same method as the other answer posted. But I think its a bit simplified:

df2 = df.apply(lambda x: x.repeat(df['Duration'].iloc[x.index] 1))
counts = df2.loc[df.Duration>1].groupby(['Start_Year', 'End_Year']).cumcount()
df2.loc[df.Duration>1,'Duration'] -= counts
df2.loc[df.Duration>1,'Start_Year']  = counts
df2.drop_duplicates(subset=['Start_Year', 'Duration'], ignore_index=True, inplace=True)
  • Related