Home > Mobile >  numpy vectorization(accumulate variable)
numpy vectorization(accumulate variable)

Time:02-14

I have a dataframe. Witch contain a few columns. It's look like this:

A B C D
1 10 a Nan
2 11 b Nan
3 12 c Nan

So if I have 'b' in column C, I should do A B. In other cases A*B. But with that I have variable that accumulate value(You will see code and it will be cleare). So I write this code

z = 0
for i, row in df.iterrows():
    a = df['A']
    b = df['B']
    c = df['C']
    if c == 'b':
        d = a   b   z
        z = z   2
    else:
        d = a*b
    df.at[i, 'D'] = d

But df.iterrows() is antipattern and I should avoid this string in my code. Because if my data set increase it will be a problem I have tried to use vectorization but I can't figure out how to accumulate. Code look like this:

z = 0
con = (df['C'] == 'b',
      df['C'] != 'b')
choise = (
    (df['A']   dfs['B'], z   2),
    (df['A'] * dfs['B'], )
)

dfs['D'], z = np.select(con, choise)

Can someone help me with that? How to accumulate variable z?

CodePudding user response:

Why not:

z = 0
new_D = []

for row in df.itertuples():
    if row.C == 'b':
        new_D.append(row.A   row.B   z)
        z  = 2
    else:
        new_D.append(row.A * row.B)

df['D'] = new_D

CodePudding user response:

I'm puzzled as to what that first code block is supposed to be doing:

z = 0
for i, row in df.iterrows():
    a = df['A']
    b = df['B']
    c = df['C']
    if c == 'b':
        d = a   b   z
        z = z   2
    else:
        d = a*b
    df.at[i, 'D'] = d

i and rows are the iteration variables, but you don't use rows, and only use i at the end to set something in the original df.

Do you understand what iterrows does (other than all it an "antipattern"):

Look at a small df:

In [168]: df = pd.DataFrame(np.arange(6).reshape(2,3), columns=['A','B','C'])
In [169]: df
Out[169]: 
   A  B  C
0  0  1  2
1  3  4  5

and do iterrows with a lots of prints:

In [170]: for i, row in df.iterrows():
     ...:     print('==========')
     ...:     print(i, type(row));print(row)
     ...:     a = df['A']
     ...:     print('a', type(a));print(a)
     ...: 
==========
0 <class 'pandas.core.series.Series'>
A    0
B    1
C    2
Name: 0, dtype: int64
a <class 'pandas.core.series.Series'>
0    0
1    3
Name: A, dtype: int64
==========
1 <class 'pandas.core.series.Series'>
A    3
B    4
C    5
Name: 1, dtype: int64
a <class 'pandas.core.series.Series'>
0    0
1    3
Name: A, dtype: int64

rows is a pandas Series (e.g. one column of a dataframe), with data from one row. It's like it turn the row into a column. df['A'] is also a Series, but one of the df columns.

That whole:

a = df['A']
b = df['B']
c = df['C']
if c == 'b':
    d = a   b   z
    z = z   2
else:
    d = a*b

block of code is working with the columns of the frame - whole columns, not values from one row. There's no point in repeating those calculations again and again in the loop.

c is a Series, so if c=='b with raise an error. Using the a from my example:

The '==' test produces a Series In [172]: a==3 Out[172]: 0 False 1 True Name: A, dtype: bool

Using that Series in an if raises an ambiguity error.

In [173]: if a==3: print('yes')
Traceback (most recent call last):
  File "<ipython-input-173-1ccc6f02d1f6>", line 1, in <module>
    if a==3: print('yes')
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 1537, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

So your use of iterrows is more than a "anti-pattern". The code that uses is just plain wrong. I gone into a lot of detail because I think you need more than "quick" answer. You need to understand what is happening in your code.

  • Related