How to modify Excel rows with a condition by using DataFrame?-CodePudding

I am working on data annotations. Data is taken from tweets. Here is the sample rows I am working with. I've been annotating about 5000 data with this format. NonEnglish column is used when I found a text (tweet) that are not written in English.

ID	Text	Sentiment	Sadness	Joy	NonEnglish
1	I am sad	2	1	0	0
2	I am happy	1	0	1	0
3	Selamat pagi	0	0	0	1
4	Selamat siang	0	0	0	1

The requirement suddenly changes. At first, I have to fill all column with 0 (sentiment, sadness, love, anger, joy, fear) if I found tweet with non-English language and fill non-English with 1, but now I have to leave all the cells blank (except non-English) if I found tweet with non-English language. It should look like this now with the new requirement.

ID	Text	Sentiment	Sadness	Love	Anger	Joy	Fear	NonEnglish
1	I am sad	2	1	0	0	0	0	0
2	I am happy	1	0	0	0	1	0	0
3	Selamat pagi							1
4	Selamat siang							1

It is not possible to edit this manually because of the data amount and time pressure, so I plan to use Python's DataFrame to fix this. Is there any way I can accomplish this? Excel sample is attached here.

CodePudding user response：

Assuming your dataframe is called df you could do something like this:

import numpy as np

columns = ['Sentiment', 'Sadness', 'Love', 'Anger', 'Joy', 'Fear']
for index, row in df.iterrows():
    if row['NonEnglish'] == 1:
        for col in columns:
            df.loc[index, col] = np.nan

CodePudding user response：

We can select the columns other than id, text and NonEnglish, then filter for rows where NonEnglish equals 1 and set the corresponding selected columns to NaN, as follows:

cols = df.columns.difference(['id', 'text', 'NonEnglish'])

df.loc[df['NonEnglish'] == 1, cols] = np.nan

Result:

print(df)

   id           text  sentiment  sadness  love  anger  joy  fear  NonEnglish
0   1       I am sad        2.0      1.0   0.0    0.0  0.0   0.0           0
1   2     I am happy        1.0      0.0   0.0    0.0  1.0   0.0           0
2   3   Selamat pagi        NaN      NaN   NaN    NaN  NaN   NaN           1
3   4  Selamat siang        NaN      NaN   NaN    NaN  NaN   NaN           1

If you want to set to blank string instead, you can use:

cols = df.columns.difference(['id', 'text', 'NonEnglish'])

df.loc[df['NonEnglish'] == 1, cols] = ''

But bear in mind that in this way, after filling those entries as blank, the related columns will become object type (string type) instead of numeric type. If you just need the figures for display purpose, it will be fine. But if you want the data for further calculations, you better take the first approach to fill those entries with NaN.

Result:

print(df)

   id           text sentiment sadness love anger joy fear  NonEnglish
0   1       I am sad         2       1    0     0   0    0           0
1   2     I am happy         1       0    0     0   1    0           0
2   3   Selamat pagi                                                 1
3   4  Selamat siang                                                 1