I have a table that has a bunch of columns and I need to create a new column based on the row type but the logic will be different for each type of row.
My data looks like this:
type | field1 | field2 | field3 | field4 |
---|---|---|---|---|
1 | a | b | c | 17 |
2 | e | f | g | 20 |
3 | i | j | k | 100 |
the logic for rows of type 1 is concatenating field1, field2, field3
the logic for rows of type 2 is concatenating field2, field3, field4
the logic for rows of type 3 is squaring field4
The super important part I would like to avoid coding each type manually as there are hundreds of different types each with their own distinct logic that will change constantly. We enforce strict SDLC so deploying updates would be a nightmare. Ideally I would put this logic into a SQL table somewhere and then just somehow use the data in my pandas logic but I don't know how to do that sort of thing.
How would you go about doing something like this?
Example:
data = pd.read_sql(query) #above data
rules = pd.read_sql(query)
rules.head()
Type | Rule |
---|---|
1 | field1 field2 field3 |
2 | field2 field3 field4 |
3 | field4**2 |
for i in rules:
data['output'] = data[filtered to i.type].apply(i.fancyLogic)
data.head()
output |
---|
abc |
fg20 |
10000 |
CodePudding user response:
If df
is your pandas dataframe, I'd define a function:
def rowwise(x):
if x['type '] == 1:
return x['field1 '] x['field2 '] x['field3 ']
if x['type '] == 2:
return x['field2 '] x['field3 '] str(x['field4'])
if x['type '] == 3:
return str(float(x['field4'])**2)
then apply it with the option axis=1
df.apply(rowwise, axis=1)
A couple of gotchas are that concatenation only works with strings, and the resulting column should have a consistent datatype. That's why there's so much type casting in the function.
CodePudding user response:
You can try using:
cond1 = df['type'] == 1
cond2 = df['type'] == 2
cond3 = df['type'] == 3
result1 = df[['field1', 'field2', 'field2']].sum(1)
result2 = df[['field2', 'field3', 'field4']].astype('str').sum(1)
result3 = df['field4'] ** 2
df['result'] = np.select([cond1, cond2, cond3], [result1, result2, result3])
Output:
type field1 field2 field3 field4 result
0 1 a b c 17 abb
1 2 e f g 20 fg20
2 3 i j k 100 10000