Using Regex to extract words from sentences in Pandas for network analysis-CodePudding

I have a Pandas dataframe where I want to extract every word from the sentences in a column and create a new dataframe where each word has its own row. In addition, the original dataframe has a rating that should be added to the new rows.

The dataframe looks like this:

base_network
Body    Rating
0   Very satisfied  4
1   My daughter lost 2 spoons, so I adjusted them ...   5
2   It was a fiftieth birthday present for my elde...   5
3   Love the shape, shine & elegance of the candle...   5
4   Poor description of what I was buying   3
... ... ...
476 Nice quality but it is too small, description ...   3
477 Edited 6 January 2020As you will have seen, th...   3
478 I love this piece of jewelleryIt is elegant an...   5
479 The leather cord is a little stiff…but I guess...   4
480 Unfortunately the lens is too small and not ve...   1
481 rows × 2 columns

I have tried to use Regex to divide the words form the sentences and store them i a new dataframe. Followed by an attempt to add the rating that matches. Using this code:

spaces = r"\s "

words = pd.DataFrame()
df = pd.DataFrame()

for rows in base_network:
    words = re.split(spaces, base_network['Body'])
    words['Rating'] = base_network['Rating']
    df = df.append(words)
    
df.head()

I get the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-19-4ff5191a493d> in <module>()
      5 
      6 for rows in base_network:
----> 7     words = re.split(spaces, base_network['Body'])
      8     words['Rating'] = base_network['Rating']
      9     df = df.append(words)

/usr/lib/python3.7/re.py in split(pattern, string, maxsplit, flags)
    213     and the remainder of the string is returned as the final element
    214     of the list."""
--> 215     return _compile(pattern, flags).split(string, maxsplit)
    216 
    217 def findall(pattern, string, flags=0):

TypeError: expected string or bytes-like object

I have tried to convert the body column to a string type but that did not solve the problem.

CodePudding user response：

Does this satisfy your needs?

# split by any space
df.Body = df.Body.str.split(pat="\s")

# "explode" the list column into a long format. 
# The Rating column is recycled accordingly
df.explode("Body")

Some additional thoughts

It might be necessary to adjust the regex to split also at any punctuation etc.
Be careful about your input data. In line 477, "Edited 6 January 2020As you..." seems to miss a space.