I have this dataset for sentiment analysis, loading the data with this code:
url = 'https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/amazon_cells_labelled.tsv'
df = pd.read_csv(url, sep='\t', names=["Sentence", "Feeling"])
The issue is the DataFrame is getting lines with NaN, but It's just part of the whole sentence.
The Output, right now is like this:
sentence feeling
I do not like it. NaN
I give it a bad score. 0
The Output should look like:
sentence feeling
I do not like it. I give it a bad score 0
Can you help me to concatenate or load the dataset based on the scores?
CodePudding user response:
Create virtual groups before groupby
and agg
rows:
grp = df['Feeling'].notna().cumsum().shift(fill_value=0)
out = df.groupby(grp).agg({'Sentence': ' '.join, 'Feeling': 'last'})
print(out)
# Output:
Sentence Feeling
Feeling
0 I try not to adjust the volume setting to avoi... 0.0
1 Good case, Excellent value. 1.0
2 I thought Motorola made reliable products!. Ba... 1.0
3 When I got this item it was larger than I thou... 0.0
4 The mic is great. 1.0
... ... ...
996 But, it was cheap so not worth the expense or ... 0.0
997 Unfortunately, I needed them soon so i had to ... 0.0
998 The only thing that disappoint me is the infra... 0.0
999 No money back on this one. You can not answer ... 0.0
1000 It's rugged. Well this one is perfect, at the ... NaN
[1001 rows x 2 columns]