I want to create a dataframe and give a lable to each file, based on the first letter of the filename:
This is where I created the dataframe, which works out fine:
[IN]
df = pd.read_csv('data.txt', sep="\t", names=['file', 'text', 'label'], header=None, engine='python')
texts = df['text'].values.astype("U")
print(df)
[OUT]
file text label
0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... NaN
1 b_002.txt Dollar gains on Greenspan speechThe dollar has... NaN
2 b_003.txt Yukos unit buyer faces loan claimThe owners of... NaN
3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... NaN
4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... NaN
... ... ... ...
2220 t_397.txt BT program to beat dialler scamsBT is introduc... NaN
2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... NaN
2222 t_399.txt Be careful how you codeA new European directiv... NaN
2223 t_400.txt US cyber security chief resignsThe man making ... NaN
2224 t_401.txt Losing yourself in online gamingOnline role pl... NaN
Now I want to insert labels based on the filename
for index, row in df.iterrows():
if row['file'].startswith('b'):
row['label'] = 0
elif row['file'].startswith('e'):
row['label'] = 1
elif row['file'].startswith('p'):
row['label'] = 2
elif row['file'].startswith('s'):
row['label'] = 3
else:
row['label'] = 4
print(df)
[OUT]
file text label
0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... 4
1 b_002.txt Dollar gains on Greenspan speechThe dollar has... 4
2 b_003.txt Yukos unit buyer faces loan claimThe owners of... 4
3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... 4
4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... 4
... ... ... ...
2220 t_397.txt BT program to beat dialler scamsBT is introduc... 4
2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... 4
2222 t_399.txt Be careful how you codeA new European directiv... 4
2223 t_400.txt US cyber security chief resignsThe man making ... 4
2224 t_401.txt Losing yourself in online gamingOnline role pl... 4
As you can see, every row got the label 4. What did I do wrong?
CodePudding user response:
here is one way to do it
instead of for loop, you can use map to assign the values to the label
# create a dictionary of key: value map
d={'b':0,'e':1,'p':2,'s':3}
else_val=4
#take the first character from the filename, and map using dictionary
# null values (else condition) will be 4
df['file'].str[:1].map(d).fillna(else_val).astype(int)
file text label
0 0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... 0
1 1 b_002.txt Dollar gains on Greenspan speechThe dollar has... 0
2 2 b_003.txt Yukos unit buyer faces loan claimThe owners of... 0
3 3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... 0
4 4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... 0
5 2220 t_397.txt BT program to beat dialler scamsBT is introduc... 4
6 2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... 4
7 2222 t_399.txt Be careful how you codeA new European directiv... 4
8 2223 t_400.txt US cyber security chief resignsThe man making ... 4
9 2224 t_401.txt Losing yourself in online gamingOnline role pl... 4
CodePudding user response:
According to the documentation usage of iterrows()
to modify data frame not guaranteed work in all cases beacuse it is not preserve dtype accross rows and etc...
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Therefore do instead as follows.
def label():
if row['file'].startswith('b'):
return 0
elif row['file'].startswith('e'):
return 1
elif row['file'].startswith('p'):
return 2
elif row['file'].startswith('s'):
return 3
else:
return 4
df['label'] = df.apply(lambda row :label(row[0]),axis=1)