Home > Enterprise >  Using Regex to extract words from sentences in Pandas for network analysis
Using Regex to extract words from sentences in Pandas for network analysis

Time:11-21

I have a Pandas dataframe where I want to extract every word from the sentences in a column and create a new dataframe where each word has its own row. In addition, the original dataframe has a rating that should be added to the new rows.

The dataframe looks like this:

base_network
Body    Rating
0   Very satisfied  4
1   My daughter lost 2 spoons, so I adjusted them ...   5
2   It was a fiftieth birthday present for my elde...   5
3   Love the shape, shine & elegance of the candle...   5
4   Poor description of what I was buying   3
... ... ...
476 Nice quality but it is too small, description ...   3
477 Edited 6 January 2020As you will have seen, th...   3
478 I love this piece of jewelleryIt is elegant an...   5
479 The leather cord is a little stiff…but I guess...   4
480 Unfortunately the lens is too small and not ve...   1
481 rows × 2 columns

I have tried to use Regex to divide the words form the sentences and store them i a new dataframe. Followed by an attempt to add the rating that matches. Using this code:

spaces = r"\s "

words = pd.DataFrame()
df = pd.DataFrame()

for rows in base_network:
    words = re.split(spaces, base_network['Body'])
    words['Rating'] = base_network['Rating']
    df = df.append(words)
    
df.head() 

I get the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-19-4ff5191a493d> in <module>()
      5 
      6 for rows in base_network:
----> 7     words = re.split(spaces, base_network['Body'])
      8     words['Rating'] = base_network['Rating']
      9     df = df.append(words)

/usr/lib/python3.7/re.py in split(pattern, string, maxsplit, flags)
    213     and the remainder of the string is returned as the final element
    214     of the list."""
--> 215     return _compile(pattern, flags).split(string, maxsplit)
    216 
    217 def findall(pattern, string, flags=0):

TypeError: expected string or bytes-like object

I have tried to convert the body column to a string type but that did not solve the problem.

CodePudding user response:

Does this satisfy your needs?

# split by any space
df.Body = df.Body.str.split(pat="\s")

# "explode" the list column into a long format. 
# The Rating column is recycled accordingly
df.explode("Body")
Some additional thoughts
  • It might be necessary to adjust the regex to split also at any punctuation etc.
  • Be careful about your input data. In line 477, "Edited 6 January 2020As you..." seems to miss a space.
  • Related