My company tracks rejection issues in a 3rd party system. Any given ticket can have multiple reasons for rejection. My coworker exports the list of rejected tickets to an Excel file to ultimately use in data visualization.
I created a Jupyter Notebook file that will split out the reasons into individual columns which are true or false. There are currently 10 possible reasons, so I have 10 separate functions that check if each value is true, and run 10 separate lambdas. It works perfectly, but it is not very clean or maintainable.
I am struggling trying to find the right way (or even just a way that works) to combine all those functions and lambdas into a cleaner set of code.
I have a series of 10 functions, one for each reason, that are almost identical:
def reason_one (x):
value = 0
if 'reason_one' in x:
value = 1
else:
pass
return value
def reason_two (x):
value = 0
if 'reason_two' in x:
value = 1
else:
pass
return value
and so on, for all 10 reasons we currently use.
Then, I run 10 nearly identical lambdas, one after the other:
df['Reason One'] = df['Labels'].map(lambda x: reason_one(x))
df['Reason Two'] = df['Labels'].map(lambda x: verification(x))
Is there a way to clean this up? Ideally, I would like to create a dictionary that has all the reason codes and the columns they should be named, then loop through the Labels column on the dataframe for each possible value, adding a column each time.
I have my dictionary set up:
error_list = {
'reason_one': 'Reason One',
'reason_two': 'Reason Two',
'reason_three': 'Reason Three',
'reason_four': 'Reason Four'
}
I like this because my coworker would be able to just change that list and run the notebook and have everything work. For example, he might add a new reason; or edit the column name for a given reason code to be more clear.
My idea was to create a function that takes in a dictionary and a column, iterates over the dictionary keys, appends either 0 or 1 to and empty list, then use that list to create a new column.
I got this far:
def breakout_columns (errors, column):
column_values = []
for key in errors:
if key in column:
value = 1
else:
value = 0
column_values.append(value)
print(column_values)
That does indeed produce a list with 10 values when I run it, however they are all 0s even when some of them should be true. I was looking for resources on iterating over dataframe rows, and I am not seeing anything remotely like what I am trying to do.
Beyond this piece not working, I am concerned my approach is inherently flawed and either (a) I should be doing something completely different to try to clean things up; or (b) what I am trying to do is not possible or does not make sense, so I need to just stick with 10 functions and 10 lambdas.
Any guidance would be greatly appreciated!
CodePudding user response:
You can loop over your error_list
and create the new series by comparing the given columns to your reason
s (and cast to an int
if you want 0 or 1 instead of False and True):
import pandas as pd
error_list = {
"reason_one": "Reason One",
"reason_two": "Reason Two",
"reason_three": "Reason Three",
"reason_four": "Reason Four",
}
df = pd.DataFrame(
{
"Labels": [
"reason_two",
"reason_two",
"reason_one",
"cat",
"reason_four",
"many",
"sandwich",
]
}
)
for reason, column_name in error_list.items():
df[column_name] = (df["Labels"] == reason).astype(int)
print(df)
prints out
Labels Reason One Reason Two Reason Three Reason Four
0 reason_two 0 1 0 0
1 reason_two 0 1 0 0
2 reason_one 1 0 0 0
3 cat 0 0 0 0
4 reason_four 0 0 0 1
5 many 0 0 0 0
6 sandwich 0 0 0 0