Home > database >  Conditionally update values in a pandas dataframe while moving across successive columns
Conditionally update values in a pandas dataframe while moving across successive columns

Time:06-28

I have a pandas DataFrame of membership records that contains some entry errors (see below for a few examples). Some members were incorrectly identified as "Joined" when they were in fact "Renewal" and/or listed as "Joined" multiple times. I want to correct these errors by turning "Joined" into "Renewal" and vice-verse as appropirate based on the column year.

df = pd.DataFrame({2012: {0: 'Renewal', 1: 'Joined', 2: np.nan, 3: np.nan},
                   2013: {0: 'Renewal', 1: 'Renewal', 2: 'Joined', 3: np.nan},
                   2014: {0: 'Joined', 1: 'Joined', 2: 'Renewal', 3: 'Renewal'},
                   2015: {0: 'Renewal', 1: 'Renewal', 2: 'Joined', 3: 'Renewal'},})
df

    2012    2013    2014    2015
0   Renewal Renewal Joined  Renewal
1   Joined  Renewal Joined  Renewal
2   NaN     Joined  Renewal Joined
3   NaN     NaN     Renewal Renewal

This works but is inelegant and time consuming: Using np.where and a loop that updates the row I can make corrections column by column. For example to fix the duplicate "Joined" rows:

years = list(df.columns)
for col in df[years[1:]]:
    df[col] = np.where(((df[years[0]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])

for col in df[years[2:]]:
    df[col] = np.where(((df[years[1]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])
    
for col in df[years[3:]]:
    df[col] = np.where(((df[years[2]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])

df
    2012    2013    2014    2015
0   Renewal Renewal Joined  Renewal
1   Joined  Renewal Renewal Renewal
2   NaN     Joined  Renewal Renewal 
3   NaN     NaN     Renewal Renewal

will get rid of any duplicating "Joined", but I'm not skilled enough to do it for all columns at once. Right now I'm manually updating for each successive column and there a million records and some 20 years of successive data.

I appreciate that there are other corrections neeeded, but if I could understand how to loop and correct this issue as a first step, I suspect I could use the same approach to fix the other difficulties.

Trying to loop the problem: I've tried several variations of a more complex loop but I get no response, an error, or accidently overwrite all the data.

For example,

for x in range(len(years)):
    for col in df[years[x 1]]:
        df[col] = np.where(((df[years[x]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])

Is there a way to update the records en masse where values associated with the first column are checked initially and then move on to the next column? It doesn't have to be a loop, I've just assumed that would be the solution.

Thank you for any suggestions / examples.

CodePudding user response:

If you want to replace all but the first "Joined" with "Renewal" - you can create a copy of the DataFrame with all "Joined" replaced with "Renewal", find the index of the first occurrence of "Joined" and only put that back in the copy

# Create a copy frame
df2 = df.applymap(lambda x: 'Renewal' if x == 'Joined' else x)
# Pull first joined from original
first_joined = df.apply(lambda x: x=='Joined', axis=1).apply(pd.Series.idxmax, axis=1) 
# Replace first "Joined" in the copy
for item in first_joined.items():
    if df.loc[item] == 'Joined':
        df2.loc[item] = 'Joined'

Output

# print(df2)
      2012     2013     2014     2015
0  Renewal  Renewal   Joined  Renewal
1   Joined  Renewal  Renewal  Renewal
2      NaN   Joined  Renewal  Renewal
3      NaN      NaN  Renewal  Renewal
  • Related