I have a python program that displays texts to be labeled by the user. After the user add labels to the displayed text, the program should be able to create a new data frame with the text presented to the user for labelling in the first column and the labels entered by the user in the second column. This new data frame will be appended to an existing data frame. My program works well, but the new data frame created displays each label entered by the user in a separate row and the text repeated in each row. My desired output is something like:
corpus labels
0 text1 x, y, z 1 text2 a,b 3 text3 c,d,e
But with my code I am getting:
corpus labels
0 text1 x,y,z 1 text1 x,y,z 2 text1 x,y,z
how to get my desired output? my code is below:
count=1
for i in sorted_dict:
count =1
a=pool_df['corpus'][i]
print(f'\n\nText {count}: index {i} \n\n{a}')
question=input('Enter new label(s) for this text? type Y for yes or N for no: ')
question.lower()
if question == 'n':
print('see you later')
break
elif question == 'y':
print('\n\nIf you think that the label printed is associated with the corpus,
type the label otherwise hit "space"\n\n')
new_label1=input('x: ')
new_label2=input('y: ')
new_label3=input('z: ')
new_label4=input('a: ')
new_label5=input('b: ')
new_label6=input('c: ' )
new_label7=input('d: ')
list_new_labels=
[new_label1,new_label2,new_label3,new_label4,new_label5,new_label6,new_label7]
list_new_labels1=[]
for i in list_new_labels:
if i != '':
list_new_labels1.append(i)
print(f'The new labels are: {list_new_labels1}')
df_new_labels={'corpus': a, 'zero_level_name': list_new_labels1}
df_new_labels=pd.DataFrame(df_new_labels)
df_new_labels
CodePudding user response:
There are two parts to why this is not working as intended.
Error 1: the df_new_labels
is created anew for every text. Instead, the new text and its labels should be appended to existing lists.
Error 2: when creating the DataFrame with df_new_labels=pd.DataFrame(df_new_labels)
, pandas automatically extends the corpus column to fit the length of your list of labels. To circumvent this, the labels should be a list of lists.
Given the following exemplary inputs:
pool_df = pd.DataFrame(columns=['corpus'], data=['text1', 'text2', 'text3'])
sorted_dict = [1, 0, 2]
One way to write this code is as follows:
count = 1
new_labels = {'corpus': [], 'zero_level_name': []} # lists to store new entries, see Error 1
corpus_list = []
label_lists = []
for i in sorted_dict:
count = 1
a = pool_df['corpus'][i]
print(f'\n\nText {count}: index {i} \n\n{a}')
question=input('Enter new label(s) for this text? type Y for yes or N for no: ')
question.lower()
if question == 'n':
print('see you later')
break
elif question == 'y':
print('\n\nIf you think that the label printed is associated with the corpus, type the label otherwise hit "space"\n\n')
new_label1=input('x: ')
new_label2=input('y: ')
new_label3=input('z: ')
new_label4=input('a: ')
new_label5=input('b: ')
new_label6=input('c: ')
new_label7=input('d: ')
list_new_labels = [new_label1,new_label2,new_label3,new_label4,new_label5,new_label6,new_label7]
list_new_labels1 = []
for i in list_new_labels:
if i != '':
list_new_labels1.append(i)
print(f'The new labels are: {list_new_labels1}')
new_labels['corpus'].append(a)
new_labels['zero_level_name'].append(list_new_labels1) # append list of labels to create list of lists of labels, see Error 2
df_new_labels = pd.DataFrame(new_labels)
df_new_labels