Home > Software design >  How to set a column by slicing values of other columns
How to set a column by slicing values of other columns

Time:09-27

I have a dataframe with the ruling party of the US, but the column is set on this format yyyy-yyyy: 'democrat' and I want my final dataframe to be like this yyyy : 'democrat'. Instead of the range of the ruling party I want a column with all years between 1945 and 2022 and another column that contains a string with 'dremocrat' or 'republican'.

enter image description here

This is what Ive been trying

us_gov = pd.read_csv('/Users/elgasko/Documents/NUMERO ARMAS NUCLEARES/presidents.csv')
us_gov = us_gov.iloc[31:,1:4]
us_gov=us_gov[['Years In Office','Party']]
us_gov.sort_values(by=['Years In Office'])
years=range(1945,2023)
us_gov_def=pd.DataFrame(years, columns=['Year'])
us_gov_def.set_index('Year', drop=True, append=False, inplace=True, verify_integrity=False)
us_gov_def.insert(0, column='Party', value=np.nan)

for i in range(len(us_gov)):
    string=us_gov.iloc[i]['Years In Office']
    inicio=string[0:4]
    inicio=int(float(inicio))
    final=string[5:9]
    final=int(float(final))
    for j in us_gov_def.index :
        if j in range(inicio,final):
            us_gov_def.loc['Party',us_gov.Party[i]]
            
#https://github.com/awhstin/Dataset-List/blob/master/presidents.csv

CodePudding user response:

One solution could be as follows:

import pandas as pd

data = {'Years In Office': ['1933-1945','1945-1953','1953-1961'],
      'Party': ['Democratic', 'Democratic', 'Republican']}

df = pd.DataFrame(data)

df['Years In Office'] = df['Years In Office'].str.split('-').explode()\
    .groupby(level=0).apply(lambda x: range(x.astype(int).min(), 
                                            x.astype(int).max() 1))
df = df.explode('Years In Office')

print(df)

   Years In Office       Party
0             1933  Democratic
1             1934  Democratic
2             1935  Democratic
3             1936  Democratic
4             1937  Democratic
5             1938  Democratic
6             1939  Democratic
7             1940  Democratic
8             1941  Democratic
9             1942  Democratic
10            1943  Democratic
11            1944  Democratic
12            1945  Democratic
13            1945  Democratic
14            1946  Democratic
15            1947  Democratic
16            1948  Democratic
17            1949  Democratic
18            1950  Democratic
19            1951  Democratic
20            1952  Democratic
21            1953  Democratic
22            1953  Republican
23            1954  Republican
24            1955  Republican
25            1956  Republican
26            1957  Republican
27            1958  Republican
28            1959  Republican
29            1960  Republican
30            1961  Republican

Notice that you will end up with duplicates:

print(df[df['Years In Office'].duplicated(keep=False)])

   Years In Office       Party
12            1945  Democratic
13            1945  Democratic
21            1953  Democratic
22            1953  Republican

This is because the periods overlap on end year & start year (e.g. '1933-1945','1945-1953'). If you don't want this, you could add:

df = df.groupby('Years In Office', as_index=False).agg({'Party':', '.join})
print(df.loc[df['Years In Office'].isin([1945, 1953])])

   Years In Office                   Party
12            1945  Democratic, Democratic
20            1953  Democratic, Republican

Or you could drop only the years where the ruling party does not change. E.g.:

df = df[~df.duplicated()].reset_index(drop=True)
print(df.loc[df['Years In Office'].isin([1945, 1953])])

   Years In Office       Party
12            1945  Democratic
20            1953  Democratic
21            1953  Republican

CodePudding user response:

This should work if you just swap out my placeholder df for your original one:

import pandas as pd
import numpy as np


df = pd.DataFrame({'Index': [31, 32, 33],
      'Years In Office': ['1933-1945', '1945-1953', '1953-1961'],
      'Party': ['d', 'd', 'r']})

print(df.head())

final = pd.DataFrame(columns=['Years In Office', 'Party'])
print(final.head())
for i, row in df.iterrows():
      start_date = int(row['Years In Office'][0:4])
      end_date = int(row['Years In Office'][-4:])

      years = np.arange(start_date, end_date, 1)
      party = [row['Party'] for x in range(len(years))]
      to_add = pd.DataFrame({'Years In Office': years,
                            'Party': party})
      final = final.append(to_add)

CodePudding user response:

try:

df[['y1', 'y2']] = df['Years In Office'].str.split('-', expand=True)

df['y'] = df.apply(lambda x: [i.strftime('%Y') for i in pd.date_range(start=x['y1'], end=x['y2'], freq='y').tolist()], axis=1)

df = df.explode('y').drop(columns=['y1', 'y2', 'Years In Office'])
  • Related