Dataframe Insert Labels if filename starts with a 'b'-CodePudding

I want to create a dataframe and give a lable to each file, based on the first letter of the filename:

This is where I created the dataframe, which works out fine:

[IN]
df = pd.read_csv('data.txt', sep="\t", names=['file', 'text', 'label'], header=None, engine='python')
texts = df['text'].values.astype("U")

print(df)

[OUT]
           file                                               text  label
0     b_001.txt  Ad sales boost Time Warner profitQuarterly pro...    NaN
1     b_002.txt  Dollar gains on Greenspan speechThe dollar has...    NaN
2     b_003.txt  Yukos unit buyer faces loan claimThe owners of...    NaN
3     b_004.txt  High fuel prices hit BA's profitsBritish Airwa...    NaN
4     b_005.txt  Pernod takeover talk lifts DomecqShares in UK ...    NaN
...         ...                                                ...    ...
2220  t_397.txt  BT program to beat dialler scamsBT is introduc...    NaN
2221  t_398.txt  Spam e-mails tempt net shoppersComputer users ...    NaN
2222  t_399.txt  Be careful how you codeA new European directiv...    NaN
2223  t_400.txt  US cyber security chief resignsThe man making ...    NaN
2224  t_401.txt  Losing yourself in online gamingOnline role pl...    NaN

Now I want to insert labels based on the filename

for index, row in df.iterrows():
    if row['file'].startswith('b'):
        row['label'] = 0
    elif row['file'].startswith('e'):
        row['label'] = 1
    elif row['file'].startswith('p'):
        row['label'] = 2
    elif row['file'].startswith('s'):
        row['label'] = 3
    else:
        row['label'] = 4

print(df)

[OUT]
           file                                               text  label
0     b_001.txt  Ad sales boost Time Warner profitQuarterly pro...      4
1     b_002.txt  Dollar gains on Greenspan speechThe dollar has...      4
2     b_003.txt  Yukos unit buyer faces loan claimThe owners of...      4
3     b_004.txt  High fuel prices hit BA's profitsBritish Airwa...      4
4     b_005.txt  Pernod takeover talk lifts DomecqShares in UK ...      4
...         ...                                                ...    ...
2220  t_397.txt  BT program to beat dialler scamsBT is introduc...      4
2221  t_398.txt  Spam e-mails tempt net shoppersComputer users ...      4
2222  t_399.txt  Be careful how you codeA new European directiv...      4
2223  t_400.txt  US cyber security chief resignsThe man making ...      4
2224  t_401.txt  Losing yourself in online gamingOnline role pl...      4

As you can see, every row got the label 4. What did I do wrong?

CodePudding user response：

here is one way to do it

instead of for loop, you can use map to assign the values to the label

# create a dictionary of key: value map
d={'b':0,'e':1,'p':2,'s':3}
else_val=4

#take the first character from the filename, and map using dictionary
# null values (else condition) will be 4
df['file'].str[:1].map(d).fillna(else_val).astype(int)

        file    text    label
0   0   b_001.txt   Ad sales boost Time Warner profitQuarterly pro...   0
1   1   b_002.txt   Dollar gains on Greenspan speechThe dollar has...   0
2   2   b_003.txt   Yukos unit buyer faces loan claimThe owners of...   0
3   3   b_004.txt   High fuel prices hit BA's profitsBritish Airwa...   0
4   4   b_005.txt   Pernod takeover talk lifts DomecqShares in UK ...   0
5   2220    t_397.txt   BT program to beat dialler scamsBT is introduc...   4
6   2221    t_398.txt   Spam e-mails tempt net shoppersComputer users ...   4
7   2222    t_399.txt   Be careful how you codeA new European directiv...   4
8   2223    t_400.txt   US cyber security chief resignsThe man making ...   4
9   2224    t_401.txt   Losing yourself in online gamingOnline role pl...   4

CodePudding user response：

According to the documentation usage of iterrows() to modify data frame not guaranteed work in all cases beacuse it is not preserve dtype accross rows and etc...

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

Therefore do instead as follows.

 def label():
        if row['file'].startswith('b'):
            return 0
        elif row['file'].startswith('e'):
            return 1
        elif row['file'].startswith('p'):
            return 2
        elif row['file'].startswith('s'):
            return 3
        else:
            return 4

df['label'] = df.apply(lambda row :label(row[0]),axis=1)