I am working on data annotations. Data is taken from tweets. Here is the sample rows I am working with. I've been annotating about 5000 data with this format. NonEnglish column is used when I found a text (tweet) that are not written in English.
ID | Text | Sentiment | Sadness | Love | Anger | Joy | Fear | NonEnglish |
---|---|---|---|---|---|---|---|---|
1 | I am sad | 2 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | I am happy | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | Selamat pagi | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | Selamat siang | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
The requirement suddenly changes. At first, I have to fill all column with 0 (sentiment, sadness, love, anger, joy, fear) if I found tweet with non-English language and fill non-English with 1, but now I have to leave all the cells blank (except non-English) if I found tweet with non-English language. It should look like this now with the new requirement.
ID | Text | Sentiment | Sadness | Love | Anger | Joy | Fear | NonEnglish |
---|---|---|---|---|---|---|---|---|
1 | I am sad | 2 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | I am happy | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | Selamat pagi | 1 | ||||||
4 | Selamat siang | 1 |
It is not possible to edit this manually because of the data amount and time pressure, so I plan to use Python's DataFrame to fix this. Is there any way I can accomplish this? Excel sample is attached here.
CodePudding user response:
Assuming your dataframe is called df you could do something like this:
import numpy as np
columns = ['Sentiment', 'Sadness', 'Love', 'Anger', 'Joy', 'Fear']
for index, row in df.iterrows():
if row['NonEnglish'] == 1:
for col in columns:
df.loc[index, col] = np.nan
CodePudding user response:
We can select the columns other than id
, text
and NonEnglish
, then filter for rows where NonEnglish
equals 1 and set the corresponding selected columns to NaN
, as follows:
cols = df.columns.difference(['id', 'text', 'NonEnglish'])
df.loc[df['NonEnglish'] == 1, cols] = np.nan
Result:
print(df)
id text sentiment sadness love anger joy fear NonEnglish
0 1 I am sad 2.0 1.0 0.0 0.0 0.0 0.0 0
1 2 I am happy 1.0 0.0 0.0 0.0 1.0 0.0 0
2 3 Selamat pagi NaN NaN NaN NaN NaN NaN 1
3 4 Selamat siang NaN NaN NaN NaN NaN NaN 1
If you want to set to blank string instead, you can use:
cols = df.columns.difference(['id', 'text', 'NonEnglish'])
df.loc[df['NonEnglish'] == 1, cols] = ''
But bear in mind that in this way, after filling those entries as blank, the related columns will become object
type (string type) instead of numeric type. If you just need the figures for display purpose, it will be fine. But if you want the data for further calculations, you better take the first approach to fill those entries with NaN
.
Result:
print(df)
id text sentiment sadness love anger joy fear NonEnglish
0 1 I am sad 2 1 0 0 0 0 0
1 2 I am happy 1 0 0 0 1 0 0
2 3 Selamat pagi 1
3 4 Selamat siang 1