How to do line level logic in Pandas-CodePudding

I have a table that has a bunch of columns and I need to create a new column based on the row type but the logic will be different for each type of row.

My data looks like this:

type	field1	field2	field3	field4
1	a	b	c	17
2	e	f	g	20
3	i	j	k	100

the logic for rows of type 1 is concatenating field1, field2, field3

the logic for rows of type 2 is concatenating field2, field3, field4

the logic for rows of type 3 is squaring field4

The super important part I would like to avoid coding each type manually as there are hundreds of different types each with their own distinct logic that will change constantly. We enforce strict SDLC so deploying updates would be a nightmare. Ideally I would put this logic into a SQL table somewhere and then just somehow use the data in my pandas logic but I don't know how to do that sort of thing.

How would you go about doing something like this?

Example:

data = pd.read_sql(query) #above data
rules = pd.read_sql(query)
rules.head()

Type	Rule
1	field1 field2 field3
2	field2 field3 field4
3	field4**2

for i in rules:
    data['output'] = data[filtered to i.type].apply(i.fancyLogic)

data.head()

output
abc
fg20
10000

CodePudding user response：

If df is your pandas dataframe, I'd define a function:

def rowwise(x):
    if x['type '] == 1:
        return x['field1 ']   x['field2 ']   x['field3 ']
    if x['type '] == 2:
        return x['field2 ']   x['field3 ']   str(x['field4'])
    if x['type '] == 3:
        return str(float(x['field4'])**2)

then apply it with the option axis=1

df.apply(rowwise, axis=1)

A couple of gotchas are that concatenation only works with strings, and the resulting column should have a consistent datatype. That's why there's so much type casting in the function.

CodePudding user response：

You can try using:

cond1 = df['type'] == 1
cond2 = df['type'] == 2
cond3 = df['type'] == 3

result1 = df[['field1', 'field2', 'field2']].sum(1)
result2 = df[['field2', 'field3', 'field4']].astype('str').sum(1)
result3 = df['field4'] ** 2

df['result'] = np.select([cond1, cond2, cond3], [result1, result2, result3])

Output:

   type field1 field2 field3  field4 result
0     1      a      b      c      17    abb
1     2      e      f      g      20   fg20
2     3      i      j      k     100  10000