Suppose I have a data frame as follows:
df = pd.DataFrame({
'Column A': [12,12,12, 13, 15, 16, 141, 141, 141, 141],
'Column B':['Apple' ,'Apple' ,'Orange' ,'Apple' , np.nan, 'Orange', 'Apple', np.nan, 'Apple', 'Apple']})
Based on these conditions:
If values in column A are repeated then count the word 'Orange' in Column B and paste it in new Column C(For example, there are 3 rows for 12, the count of 'Orange' is 1, and this 1 should be in new column C). For the non-repeat rows, just paste the corresponding values.
If values in column A are repeated then count the word 'Apple' in Column B and paste it in new Column D(For example, there are 3 rows for 12, the count of 'Apple' is 2, and this 2 should be in new column D). For the non-repeat rows, just paste the corresponding values.
For repeated and non-repeated rows due to Column A, If the word 'Orange' is present in Column B, write 'yes' else 'No' in Column E.
I would like to have an output following. I was trying in python jupyter notebook, can anyone please help me to get an output like this:
| Column A | Column B |Column C |Column D |Column E
----- | -------- | ---------|---------|---------|---------
0 | 12 | Apple |1 |2 |Yes
1 | 13 | Apple |0 |1 |No
2 | 15 | NaN |NaN |NaN |NaN
3 | 16 | Orange |1 |0 |Yes
4 | 141 | Apple |0 |3 |No
Thanks in advance:)
CodePudding user response:
I think there is no powerful and simple solution for your question, but use the following code.
First, define a function count(x, a)
which returns nan
if x includes nan
, the number of occurence of a in x, otherwise.
The function will be used for the apply function.
Then, use groupby and apply list function.
temp = df.copy().groupby('Column A')['Column B'].apply(list)
After that, temp becomes
Column A
12 [Apple, Apple, Orange]
13 [Apple]
15 [nan]
16 [Orange]
141 [Apple, nan, Apple, Apple]
Name: Column B, dtype: object
So, based on the temp, we can count the number of apples and oranges.
Since df has duplicates, I removed them and add new columns (Column C, D, and E).
df.drop_duplicates(subset = ['Column A'], keep = "first", inplace = True)
df['Column C'] = temp.apply(count, a = "Orange").values
df['Column D'] = temp.apply(count, a = "Apple").values
df['Column E'] = df['Column D'].apply(lambda x:1 if x>=1 else 0)
Edit
I am sorry. I missed the function count..
def count(x, a):
if type(x[0]) == float:
return np.nan
else:
return x.count(a)