I have a dataframe. Witch contain a few columns. It's look like this:
A | B | C | D |
---|---|---|---|
1 | 10 | a | Nan |
2 | 11 | b | Nan |
3 | 12 | c | Nan |
So if I have 'b' in column C, I should do A B. In other cases A*B. But with that I have variable that accumulate value(You will see code and it will be cleare). So I write this code
z = 0
for i, row in df.iterrows():
a = df['A']
b = df['B']
c = df['C']
if c == 'b':
d = a b z
z = z 2
else:
d = a*b
df.at[i, 'D'] = d
But df.iterrows() is antipattern and I should avoid this string in my code. Because if my data set increase it will be a problem I have tried to use vectorization but I can't figure out how to accumulate. Code look like this:
z = 0
con = (df['C'] == 'b',
df['C'] != 'b')
choise = (
(df['A'] dfs['B'], z 2),
(df['A'] * dfs['B'], )
)
dfs['D'], z = np.select(con, choise)
Can someone help me with that? How to accumulate variable z?
CodePudding user response:
Why not:
z = 0
new_D = []
for row in df.itertuples():
if row.C == 'b':
new_D.append(row.A row.B z)
z = 2
else:
new_D.append(row.A * row.B)
df['D'] = new_D
CodePudding user response:
I'm puzzled as to what that first code block is supposed to be doing:
z = 0
for i, row in df.iterrows():
a = df['A']
b = df['B']
c = df['C']
if c == 'b':
d = a b z
z = z 2
else:
d = a*b
df.at[i, 'D'] = d
i
and rows
are the iteration variables, but you don't use rows
, and only use i
at the end to set something in the original df
.
Do you understand what iterrows
does (other than all it an "antipattern"):
Look at a small df:
In [168]: df = pd.DataFrame(np.arange(6).reshape(2,3), columns=['A','B','C'])
In [169]: df
Out[169]:
A B C
0 0 1 2
1 3 4 5
and do iterrows
with a lots of prints:
In [170]: for i, row in df.iterrows():
...: print('==========')
...: print(i, type(row));print(row)
...: a = df['A']
...: print('a', type(a));print(a)
...:
==========
0 <class 'pandas.core.series.Series'>
A 0
B 1
C 2
Name: 0, dtype: int64
a <class 'pandas.core.series.Series'>
0 0
1 3
Name: A, dtype: int64
==========
1 <class 'pandas.core.series.Series'>
A 3
B 4
C 5
Name: 1, dtype: int64
a <class 'pandas.core.series.Series'>
0 0
1 3
Name: A, dtype: int64
rows
is a pandas Series (e.g. one column of a dataframe), with data from one row. It's like it turn the row into a column. df['A']
is also a Series, but one of the df
columns.
That whole:
a = df['A']
b = df['B']
c = df['C']
if c == 'b':
d = a b z
z = z 2
else:
d = a*b
block of code is working with the columns of the frame - whole columns, not values from one row. There's no point in repeating those calculations again and again in the loop.
c
is a Series, so if c=='b
with raise an error. Using the a
from my example:
The '==' test produces a Series In [172]: a==3 Out[172]: 0 False 1 True Name: A, dtype: bool
Using that Series in an if
raises an ambiguity
error.
In [173]: if a==3: print('yes')
Traceback (most recent call last):
File "<ipython-input-173-1ccc6f02d1f6>", line 1, in <module>
if a==3: print('yes')
File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 1537, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So your use of iterrows
is more than a "anti-pattern". The code that uses is just plain wrong. I gone into a lot of detail because I think you need more than "quick" answer. You need to understand what is happening in your code.