Loading Pandas Dataframe with skipped sentiment-CodePudding

I have this dataset for sentiment analysis, loading the data with this code:

url = 'https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/amazon_cells_labelled.tsv'
df = pd.read_csv(url, sep='\t', names=["Sentence", "Feeling"])

The issue is the DataFrame is getting lines with NaN, but It's just part of the whole sentence.

The Output, right now is like this:

sentence                      feeling
I do not like it.             NaN
I give it a bad score.        0

The Output should look like:

sentence                                    feeling
I do not like it. I give it a bad score     0

Can you help me to concatenate or load the dataset based on the scores?

CodePudding user response：

Create virtual groups before groupby and agg rows:

grp = df['Feeling'].notna().cumsum().shift(fill_value=0)
out = df.groupby(grp).agg({'Sentence': ' '.join, 'Feeling': 'last'})
print(out)

# Output:
                                                  Sentence  Feeling
Feeling                                                            
0        I try not to adjust the volume setting to avoi...      0.0
1                              Good case, Excellent value.      1.0
2        I thought Motorola made reliable products!. Ba...      1.0
3        When I got this item it was larger than I thou...      0.0
4                                        The mic is great.      1.0
...                                                    ...      ...
996      But, it was cheap so not worth the expense or ...      0.0
997      Unfortunately, I needed them soon so i had to ...      0.0
998      The only thing that disappoint me is the infra...      0.0
999      No money back on this one. You can not answer ...      0.0
1000     It's rugged. Well this one is perfect, at the ...      NaN

[1001 rows x 2 columns]