I have huge data frame that looks like this:
Col1 Col2
0. 'w1'. 'N/A'
1. 'w2' 4.3
2. 'w3'. 1.2
4. 'w4'. 3.5
5. 'w5' 'N/A'
6. 'w6'. 3.1
7. 'w7'. 2.4
8. 'w8'. 1.7
9. 'w9'. 4.6
10. 'w10'. 'N/A'
11. 'w11'. 3.0
I have strings in the first column. In the sercond column I have floats and the 'N/A' string in some rows. The strings in col1 are answer options to a question, the floats in col2 are the answer avg. counts and the "N/A" string is the question itself. My intention is to create another column in this data frame named "Question_number" whose levels (Q1,Q2,Q3...) change every time a new 'N/A' row in col2 appears. Thus, my desired output is:
Col1 Col2 Col3
0. 'Q1' 'w1'. 'N/A'
1. 'Q1' 'w2' 4.3
2. 'Q1' 'w3'. 1.2
4. 'Q1' 'w4'. 3.5
5. 'Q2' 'w5' 'N/A'
6. 'Q2' 'w6'. 3.1
7. 'Q2'. 'w7'. 2.4
8. 'Q2' 'w8'. 1.7
9. 'Q3' 'w9'. 'N/A'
10. 'Q3' 'w10'. 2.0
11. 'Q3' 'w11'. 3.0
Can anyone help me with this? I tried this:
df['question_number']=np.where(df['counts']=='N/A', "Q1", "Q2", 'Q3')
But that does not work and I have no idea on how to do this. Can anyone help, please?
CodePudding user response:
Update: @enke's comment is MUCH cleaner than my code. If you want to place the new column in the first position, simply add the column first.
# insert new column in first position
df.insert(0,'new','')
# Line from @enke's comment
df['new'] = 'Q' df['Col2'].eq('N/A').cumsum().astype(str)
My original post is below and you can ignore if ya wanna!
I inserted a new column in the first column position then loop through each row. I rename the column names just because I can. :D I did not include the quotes and periods in your original df. But the code below may still be useful.
import pandas as pd
cols = ['Answer', 'Avg_Score']
data=[['w1', 'N/A'],
['w2', 4.3],
['w3', 1.2],
['w4', 3.5],
['w5', 'N/A'],
['w6', 3.1],
['w7', 2.4],
['w8', 1.7],
['w9', 4.6],
['w10', 'N/A'],
['w11', 3.0]]
df = pd.DataFrame(data, columns = cols)
# insert new column before the Answer and the Avg_Score
df.insert(0,'Question','')
# start the question counter at 0
qnum = 0
# loop through each row
for index,row in df.iterrows():
# if 'N/A' found increase the question counter
# this assume first row will always have an 'N/A'
if df.loc[index,'Avg_Score'] == 'N/A':
qnum = 1
df.loc[index,'Question'] = 'Q{}'.format(qnum)
print(df)
Ouput:
Question Answer Avg_Score
0 Q1 w1 N/A
1 Q1 w2 4.3
2 Q1 w3 1.2
3 Q1 w4 3.5
4 Q2 w5 N/A
5 Q2 w6 3.1
6 Q2 w7 2.4
7 Q2 w8 1.7
8 Q2 w9 4.6
9 Q3 w10 N/A
10 Q3 w11 3.0