Home > database >  Extract words (letters only) and words containing numbers into separate dataframe columns
Extract words (letters only) and words containing numbers into separate dataframe columns

Time:11-19

I'm trying to extract words which contain letter only into a new column, and any word which contains a number into a different column.

Desired Output:

              query    words_only contains_number
0   Nike Air Max 97  Nike Air Max              97
1     Adidas NMD-R1        Adidas          NMD-R1
2  Nike Air Max 270  Nike Air Max             270

What I've Tried:

I've seen the answer here which gets me some of the way there, but it's not exactly what I need.

How to extract words containing only letters from a text in python?

Minimum Reproducible Example:

# Import pandas library
import pandas as pd

# initialize list elements
data = ["Nike Air Max 97", "Adidas NMD R1", "Nike Air Max 270"]

# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['query'])

# print dataframe.
print(df)

CodePudding user response:

You can use a regex with str.extractall to extract the words with and without digits separately, then groupby.agg to join them separately:

df[['words_only', 'contains_number']] = (df['query']
 .str.extractall(r'(\S*\d\S*)|([^\s\d] )') # order is important
 .groupby(level=0).agg(lambda s: ' '.join(s.dropna()))
 .loc[:, ::-1] # invert 2 columns
)

Output:

              query    words_only contains_number
0   Nike Air Max 97  Nike Air Max              97
1     Adidas NMD-R1        Adidas          NMD-R1
2  Nike Air Max 270  Nike Air Max             270

regex demo

  • Related