I have a dataframe with the ruling party of the US, but the column is set on this format yyyy-yyyy: 'democrat' and I want my final dataframe to be like this yyyy : 'democrat'. Instead of the range of the ruling party I want a column with all years between 1945 and 2022 and another column that contains a string with 'dremocrat' or 'republican'.
This is what Ive been trying
us_gov = pd.read_csv('/Users/elgasko/Documents/NUMERO ARMAS NUCLEARES/presidents.csv')
us_gov = us_gov.iloc[31:,1:4]
us_gov=us_gov[['Years In Office','Party']]
us_gov.sort_values(by=['Years In Office'])
years=range(1945,2023)
us_gov_def=pd.DataFrame(years, columns=['Year'])
us_gov_def.set_index('Year', drop=True, append=False, inplace=True, verify_integrity=False)
us_gov_def.insert(0, column='Party', value=np.nan)
for i in range(len(us_gov)):
string=us_gov.iloc[i]['Years In Office']
inicio=string[0:4]
inicio=int(float(inicio))
final=string[5:9]
final=int(float(final))
for j in us_gov_def.index :
if j in range(inicio,final):
us_gov_def.loc['Party',us_gov.Party[i]]
#https://github.com/awhstin/Dataset-List/blob/master/presidents.csv
CodePudding user response:
One solution could be as follows:
import pandas as pd
data = {'Years In Office': ['1933-1945','1945-1953','1953-1961'],
'Party': ['Democratic', 'Democratic', 'Republican']}
df = pd.DataFrame(data)
df['Years In Office'] = df['Years In Office'].str.split('-').explode()\
.groupby(level=0).apply(lambda x: range(x.astype(int).min(),
x.astype(int).max() 1))
df = df.explode('Years In Office')
print(df)
Years In Office Party
0 1933 Democratic
1 1934 Democratic
2 1935 Democratic
3 1936 Democratic
4 1937 Democratic
5 1938 Democratic
6 1939 Democratic
7 1940 Democratic
8 1941 Democratic
9 1942 Democratic
10 1943 Democratic
11 1944 Democratic
12 1945 Democratic
13 1945 Democratic
14 1946 Democratic
15 1947 Democratic
16 1948 Democratic
17 1949 Democratic
18 1950 Democratic
19 1951 Democratic
20 1952 Democratic
21 1953 Democratic
22 1953 Republican
23 1954 Republican
24 1955 Republican
25 1956 Republican
26 1957 Republican
27 1958 Republican
28 1959 Republican
29 1960 Republican
30 1961 Republican
Notice that you will end up with duplicates:
print(df[df['Years In Office'].duplicated(keep=False)])
Years In Office Party
12 1945 Democratic
13 1945 Democratic
21 1953 Democratic
22 1953 Republican
This is because the periods overlap on end year & start year (e.g. '1933-1945','1945-1953'
). If you don't want this, you could add:
df = df.groupby('Years In Office', as_index=False).agg({'Party':', '.join})
print(df.loc[df['Years In Office'].isin([1945, 1953])])
Years In Office Party
12 1945 Democratic, Democratic
20 1953 Democratic, Republican
Or you could drop only the years where the ruling party does not change. E.g.:
df = df[~df.duplicated()].reset_index(drop=True)
print(df.loc[df['Years In Office'].isin([1945, 1953])])
Years In Office Party
12 1945 Democratic
20 1953 Democratic
21 1953 Republican
CodePudding user response:
This should work if you just swap out my placeholder df for your original one:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Index': [31, 32, 33],
'Years In Office': ['1933-1945', '1945-1953', '1953-1961'],
'Party': ['d', 'd', 'r']})
print(df.head())
final = pd.DataFrame(columns=['Years In Office', 'Party'])
print(final.head())
for i, row in df.iterrows():
start_date = int(row['Years In Office'][0:4])
end_date = int(row['Years In Office'][-4:])
years = np.arange(start_date, end_date, 1)
party = [row['Party'] for x in range(len(years))]
to_add = pd.DataFrame({'Years In Office': years,
'Party': party})
final = final.append(to_add)
CodePudding user response:
try:
df[['y1', 'y2']] = df['Years In Office'].str.split('-', expand=True)
df['y'] = df.apply(lambda x: [i.strftime('%Y') for i in pd.date_range(start=x['y1'], end=x['y2'], freq='y').tolist()], axis=1)
df = df.explode('y').drop(columns=['y1', 'y2', 'Years In Office'])