I have a pandas DataFrame of membership records that contains some entry errors (see below for a few examples). Some members were incorrectly identified as "Joined" when they were in fact "Renewal" and/or listed as "Joined" multiple times. I want to correct these errors by turning "Joined" into "Renewal" and vice-verse as appropirate based on the column year.
df = pd.DataFrame({2012: {0: 'Renewal', 1: 'Joined', 2: np.nan, 3: np.nan},
2013: {0: 'Renewal', 1: 'Renewal', 2: 'Joined', 3: np.nan},
2014: {0: 'Joined', 1: 'Joined', 2: 'Renewal', 3: 'Renewal'},
2015: {0: 'Renewal', 1: 'Renewal', 2: 'Joined', 3: 'Renewal'},})
df
2012 2013 2014 2015
0 Renewal Renewal Joined Renewal
1 Joined Renewal Joined Renewal
2 NaN Joined Renewal Joined
3 NaN NaN Renewal Renewal
This works but is inelegant and time consuming: Using np.where and a loop that updates the row I can make corrections column by column. For example to fix the duplicate "Joined" rows:
years = list(df.columns)
for col in df[years[1:]]:
df[col] = np.where(((df[years[0]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])
for col in df[years[2:]]:
df[col] = np.where(((df[years[1]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])
for col in df[years[3:]]:
df[col] = np.where(((df[years[2]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])
df
2012 2013 2014 2015
0 Renewal Renewal Joined Renewal
1 Joined Renewal Renewal Renewal
2 NaN Joined Renewal Renewal
3 NaN NaN Renewal Renewal
will get rid of any duplicating "Joined", but I'm not skilled enough to do it for all columns at once. Right now I'm manually updating for each successive column and there a million records and some 20 years of successive data.
I appreciate that there are other corrections neeeded, but if I could understand how to loop and correct this issue as a first step, I suspect I could use the same approach to fix the other difficulties.
Trying to loop the problem: I've tried several variations of a more complex loop but I get no response, an error, or accidently overwrite all the data.
For example,
for x in range(len(years)):
for col in df[years[x 1]]:
df[col] = np.where(((df[years[x]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])
Is there a way to update the records en masse where values associated with the first column are checked initially and then move on to the next column? It doesn't have to be a loop, I've just assumed that would be the solution.
Thank you for any suggestions / examples.
CodePudding user response:
If you want to replace all but the first "Joined" with "Renewal" - you can create a copy of the DataFrame
with all "Joined" replaced with "Renewal", find the index of the first occurrence of "Joined" and only put that back in the copy
# Create a copy frame
df2 = df.applymap(lambda x: 'Renewal' if x == 'Joined' else x)
# Pull first joined from original
first_joined = df.apply(lambda x: x=='Joined', axis=1).apply(pd.Series.idxmax, axis=1)
# Replace first "Joined" in the copy
for item in first_joined.items():
if df.loc[item] == 'Joined':
df2.loc[item] = 'Joined'
Output
# print(df2)
2012 2013 2014 2015
0 Renewal Renewal Joined Renewal
1 Joined Renewal Renewal Renewal
2 NaN Joined Renewal Renewal
3 NaN NaN Renewal Renewal