Start_Year End_Year Opp1 Opp2 Duration
1500 1501 ['A','B'] ['C','D'] 1
1500 1510 ['P','Q','R'] ['X','Y'] 10
1520 1520 ['A','X'] ['C'] 0
... .... ........ ..... ..
1809 1820 ['M'] ['F','H','Z'] 11
My dataset(csv file format) is of armed wars fought between different entities(countries, states, and factions represented by Capital letters A, B, P, Q etc as lists in Opp1(opposition) and Opp2 columns. Start_Year and End_Year are the years about when the war started and when it ended. The Duration column is created by subtracting values of End_Year to Start_Year.
I want to replicate those rows with Duration greater than 1 by the factor of the Duration of war i.e if duration is 6 years then replicate that row 6 times and decrease the Duration values by 1 and increase the Start_Year by 1 for every replication in replicated rows and keep the values in other columns same. My desired output column is like this:
I have no clue how to proceed with something like this as I am a beginner in data science and analysis. So pardon me for not showing any trial codes here.
Start_Year End_Year Opp1 Opp2 Duration
1500 1501 ['A','B'] ['C','D'] 1
1501 1501 ['A','B'] ['C','D'] 0
1500 1510 ['P','Q','R'] ['X','Y'] 10
1501 1510 ['P','Q','R'] ['X','Y'] 9
1502 1510 ['P','Q','R'] ['X','Y'] 8
1503 1510 ['P','Q','R'] ['X','Y'] 7
1504 1510 ['P','Q','R'] ['X','Y'] 6
1505 1510 ['P','Q','R'] ['X','Y'] 5
.... .... ............. ........ ..
1510 1510 ['P','Q','R'] ['X','Y'] 0
1520 1520 ['A','X'] ['C'] 0
... .... ........ ..... ..
1809 1820 ['M'] ['F','H','Z'] 11
1810 1820 ['M'] ['F','H','Z'] 10
.... .... ..... .............. ..
Edit:1 Some example dataset The Dataset
CodePudding user response:
You can use pandas.Index.repeat
to repeat the rows [Duration times
] based on column Duration
and then using pandas.core.groupby.GroupBy.cumcount
you can add increasing cumulative values to the start_year
column.
Reading data
data = [[1500, 1501, ['A','B'], ['C','D'], 1],
[1500, 1510, ['P','Q','R'], ['X','Y'], 10],
[1520, 1520, ['A','X'], ['C'], 0],
[1809, 1820, ['M'], ['F','H','Z'], 11]]
df = pd.DataFrame(data, columns = ['Start_Year', 'End_Year', 'Opp1', 'Opp2', 'Duration'])
Repeating the values
mask = df['Duration'].gt(0)
df1 = df[mask].copy()
df1 = df1.loc[df1.index.repeat(df1['Duration'] 1)]
Assigning increasing values to each group
df1['Start_Year'] = df1[['Opp1', 'Opp2']].astype(str).groupby(['Opp1', 'Opp2']).cumcount()
Generating output
df1['Duration'] = df1['End_Year'] - df1['Start_Year']
df = pd.concat([df1, df[~mask]]).sort_index(key=sorted).reset_index(drop=True)
This gives us the expected output :
df
Start_Year End_Year Opp1 Opp2 Duration
0 1500 1501 [A, B] [C, D] 1
1 1500 1510 [P, Q, R] [X, Y] 10
2 1501 1510 [P, Q, R] [X, Y] 9
3 1502 1510 [P, Q, R] [X, Y] 8
4 1503 1510 [P, Q, R] [X, Y] 7
5 1504 1510 [P, Q, R] [X, Y] 6
6 1505 1510 [P, Q, R] [X, Y] 5
7 1506 1510 [P, Q, R] [X, Y] 4
8 1507 1510 [P, Q, R] [X, Y] 3
9 1508 1510 [P, Q, R] [X, Y] 2
10 1509 1510 [P, Q, R] [X, Y] 1
11 1520 1520 [A, X] [C] 0
12 1810 1820 [M] [F, H, Z] 10
13 1811 1820 [M] [F, H, Z] 9
14 1812 1820 [M] [F, H, Z] 8
15 1813 1820 [M] [F, H, Z] 7
16 1814 1820 [M] [F, H, Z] 6
17 1815 1820 [M] [F, H, Z] 5
18 1816 1820 [M] [F, H, Z] 4
19 1817 1820 [M] [F, H, Z] 3
20 1818 1820 [M] [F, H, Z] 2
21 1819 1820 [M] [F, H, Z] 1
22 1809 1820 [M] [F, H, Z] 11
Alternatively
You can also try the other way around after Repeating the values
by assigning Duration in first decreasing cumulatively. And then calculating the 'Start_Year' again
df1['Duration'] = df1[['Opp1', 'Opp2']].astype(str).groupby(['Opp1', 'Opp2']).cumcount(ascending=False) 1
df1['Start_Year'] = df1['End_Year'] - df1['Duration']
df = pd.concat([df1, df[~mask]]).sort_index().reset_index(drop=True)
Output :
This gives you same expected output:
Start_Year End_Year Opp1 Opp2 Duration
0 1500 1501 [A, B] [C, D] 1
1 1500 1510 [P, Q, R] [X, Y] 10
2 1501 1510 [P, Q, R] [X, Y] 9
3 1502 1510 [P, Q, R] [X, Y] 8
4 1503 1510 [P, Q, R] [X, Y] 7
5 1504 1510 [P, Q, R] [X, Y] 6
6 1505 1510 [P, Q, R] [X, Y] 5
7 1506 1510 [P, Q, R] [X, Y] 4
8 1507 1510 [P, Q, R] [X, Y] 3
9 1508 1510 [P, Q, R] [X, Y] 2
10 1509 1510 [P, Q, R] [X, Y] 1
11 1520 1520 [A, X] [C] 0
12 1810 1820 [M] [F, H, Z] 10
13 1811 1820 [M] [F, H, Z] 9
14 1812 1820 [M] [F, H, Z] 8
15 1813 1820 [M] [F, H, Z] 7
16 1814 1820 [M] [F, H, Z] 6
17 1815 1820 [M] [F, H, Z] 5
18 1816 1820 [M] [F, H, Z] 4
19 1817 1820 [M] [F, H, Z] 3
20 1818 1820 [M] [F, H, Z] 2
21 1819 1820 [M] [F, H, Z] 1
22 1809 1820 [M] [F, H, Z] 11
You can reset the index using pandas.DataFrame.reset_index
.
Summary :
Basically, what we have done here is duplicated rows based on value from column Duration
with condition.
We saved the rows which could have got vanished on using pandas.Index.repeat
to repeat the rows [Duration value times
] and once we replicated and applied logic on the rows with Duration > 0
replacing column values by subsequent increasing/decreasing
cumulative values using pandas.core.groupby.GroupBy.cumcount
we concatenated both the dataframe
and sorted them on index
using pandas.DataFrame.sort_index
since the index was also supposed to be repeated when we used pandas.Index.repeat
to repeat the rows [Duration value times
]. Hence the sort on index would give us the dataframe in same order as it was in the original dataframe.
CodePudding user response:
Almost the same method as the other answer posted. But I think its a bit simplified:
df2 = df.apply(lambda x: x.repeat(df['Duration'].iloc[x.index] 1))
counts = df2.loc[df.Duration>1].groupby(['Start_Year', 'End_Year']).cumcount()
df2.loc[df.Duration>1,'Duration'] -= counts
df2.loc[df.Duration>1,'Start_Year'] = counts
df2.drop_duplicates(subset=['Start_Year', 'Duration'], ignore_index=True, inplace=True)