python pandas split string based on length condition-CodePudding

I have following challenge, where I'd like to ask for your support. Suppose you have a frame with multiple columns. Here I focus on the important column (name)


df=pd.DataFrame({"Name":["This is a long string", "This an even longer string", "This is the
                                                                                 longest string"]})


                        Name
0       This is a long string
1  This is an even longer string
2  This is the longest string

The Name Column has the characteristics that it is allowed to contain a string of length maximum 10. If the rule is violated it should split the string into substrings and expand it into additional columns, which all have the same characteristics of string length 10

Question: How can I split the column Name in a way that the outcome should look like this

      Name                 Name1              Name2      Name3
   "This is a"      "long string"
   "This is an"        "even"                  "longer"    "String"

I tried multiple approaches, however without success.

I'd be already happy if you could support me in splitting the Name column into substrings if a string length of 10 is reached, i.e. two columns, the first column containing the string with length lower than 10 and then the second column the remaining string, i.e.

Name               Name1
"This is a"        "longer string"
"This is an"       "even longer string"
"This is"          "the longest string"

CodePudding user response：

You can use textwrap.wrap, however note that it still counts spaces as characters, so long string would be of length 11, not 10

import textwrap
df.Name.apply(lambda x: textwrap.wrap(x, 10)).agg(pd.Series)

If you want to have two seperate columns, try:

new_df = df.Name.apply(lambda x: textwrap.wrap(x, 10))
df['name1'] = new_df.apply(lambda x: x[0])
df['name2'] = new_df.apply(lambda x: ' '.join(x[1:]))

CodePudding user response：

You can use regex for this. The regex string r'[\w\s]{1,10}' matches a pattern of up to 10 instances of letters, digits, underscore and whitespace.

import pandas as pd

df=pd.DataFrame({"Name":["This is a long string", "This is an even longer string", "This should be the longest string of this collection"]})
max_len = df['Name'].str.len().max()
cols = [f'Name{i 1}' for i in range(int(max_len/10)   1)] #create columns names
df[cols] = df['Name'].str.findall(r'[\w\s]{1,10}').agg(pd.Series).fillna('')
df

Output:

	Name	Name1	Name2	Name3	Name4	Name5	Name6
0	This is a long string	This is a	long strin	g
1	This is an even longer string	This is an	even long	er string
2	This should be the longest string of this coll...	This shoul	d be the l	ongest str	ing of thi	s collecti	on

CodePudding user response：

df['Name1'] = df['Name'].str.split(' ').str[:2]
df['Name2'] = df['Name'].str.split(' ').str[2:]

CodePudding user response：

You can split each sentence using the built-in textwrap.wrap function, feed the resulting list to a DataFrame, fill the NaNs with the empty string, and rename the columns as you wish using DataFrame.rename:

from textwrap import wrap

df = pd.DataFrame({"Name":["This is a long string", "This an even longer string", "This is the longest string"]})

res = (
    pd.DataFrame([wrap(text, 10) for text in df['Name']])
      .fillna('')
      .rename(columns=lambda col_idx: f'Name{col_idx   1}')
)

Output:

>>> res

       Name1 Name2    Name3   Name4
0  This is a  long   string        
1    This an  even   longer  string
2    This is   the  longest  string

CodePudding user response：

Try this. This does not count spaces as part of the length, which I believe was the intention

s = df['Name'].str.split().explode()


(s.groupby([pd.Grouper(level=0),s.str.len().groupby(level=0).cumsum().floordiv(10)])
.agg(' '.join)
.unstack()
.rename(lambda x: 'Name{}'.format(x 1),axis=1))

Output:

Name        Name1        Name2   Name3
0       This is a  long string     NaN
1      This is an  even longer  string
2     This is the      longest  string