unfortunately I can't get it to write a calculated value from one row to the same row, so that it creates a new dataframe that has two new columns of calculated values.
My dataframe looks like this:
VP | text1 | text2 |
---|---|---|
1 | Text1 | Text2 |
2 | Text3 | Text4 |
3 | Text5 | Text 6 |
My goal should look like this:
VP | text1 | text2 | error_count1 | error_count2 |
---|---|---|---|---|
1 | Text1 | Text2 | 2 | 5 |
2 | Text3 | Text4 | 4 | 7 |
3 | Text5 | Text 6 | 8 | 9 |
I tried this:
def compare_texts(text1: str, text2: str, data: pd.DataFrame, switch: bool ):
"""
Compare each text from data with text1 and text2. Return founded errors.
:param text1: Correct Text 1
:param text2: Correct Text 2
:param data: dataframe of participant data
:return data: new dataframe
"""
# Insert new empty columns for inseration.
if switch == False:
data["error_count1"] = ""
data["error_count2"] = ""
else:
data["error_count1_rev"] = ""
data["error_count2_rev"] = ""
for index, row in data.iterrows():
# get participant data into variables to pass as parameter
participant = row['VP']
pp_text1 = row['text1']
pp_text2 = row['text2']
if switch == False:
error_count_1 = Levenshtein.distance(words(pp_text1), words(text1))
error_count_2 = Levenshtein.distance(words(pp_text2), words(text2))
data[index,'error_count1'] = error_count_1 # Here is the problematic code that needs to be adjusted
data[index,'error_count2'] = error_count_2
else: # Switch compared text, because we changed texts in week 3.
error_count_1 = Levenshtein.distance(words(pp_text2), words(text1))
error_count_2 = Levenshtein.distance(words(pp_text1), words(text2))
data['error_count1_rev'] = error_count_1
data['error_count2_rev'] = error_count_2
return data
But the end result, unfortunately, looks like this:
VP | text1 | text2 | error_count1 | error_count2 | error_count1 | error_count 2 | error_count1 | error_count2 |
---|---|---|---|---|---|---|---|---|
1 | Text1 | Text2 | 2 | 5 | 4 | 7 | 8 | 9 |
2 | Text3 | Text4 | 2 | 5 | 4 | 7 | 8 | 9 |
3 | Text5 | Text 6 | 2 | 5 | 4 | 7 | 8 | 9 |
If I omit "index", then the last value in all rows is stored in the columns.
So I have to make it somehow that only the value in the row of the corresponding column is stored.
CodePudding user response:
solution
using loc
, data.loc[index,'error_count1'] = error_count_1
btw
I tested your code, but got result like this
for idx, row in data.iterrows():
data[idx,'add col'] = idx
text1 text2 (0, add col) (1, add col) (2, add col)
0 ABC ABC 0 1 2
1 ABC abc 0 1 2
2 XYZ ABC 0 1 2
CodePudding user response:
I suggest using pandas.DataFrame.apply
for this task consider following simple example: lets say you have text1
and text2
and your task is to find if they are same case-sensitive and case-insensitive then you might do
import pandas as pd
df = pd.DataFrame({'text1':['ABC','ABC','XYZ'],'text2':['ABC','abc','ABC']})
def same(row):
return {"sensitive":row["text1"]==row["text2"],"insensitive":row["text1"].lower()==row["text2"].lower()}
dfsame = df.apply(same,axis=1,result_type="expand")
dffinal = pd.concat([df,dfsame],axis=1)
print(dffinal)
output
text1 text2 sensitive insensitive
0 ABC ABC True True
1 ABC abc False True
2 XYZ ABC False False