I have following challenge, where I'd like to ask for your support. Suppose you have a frame with multiple columns. Here I focus on the important column (name)
df=pd.DataFrame({"Name":["This is a long string", "This an even longer string", "This is the
longest string"]})
Name
0 This is a long string
1 This is an even longer string
2 This is the longest string
The Name Column has the characteristics that it is allowed to contain a string of length maximum 10. If the rule is violated it should split the string into substrings and expand it into additional columns, which all have the same characteristics of string length 10
Question: How can I split the column Name in a way that the outcome should look like this
Name Name1 Name2 Name3
"This is a" "long string"
"This is an" "even" "longer" "String"
I tried multiple approaches, however without success.
I'd be already happy if you could support me in splitting the Name column into substrings if a string length of 10 is reached, i.e. two columns, the first column containing the string with length lower than 10 and then the second column the remaining string, i.e.
Name Name1
"This is a" "longer string"
"This is an" "even longer string"
"This is" "the longest string"
CodePudding user response:
You can use textwrap.wrap
, however note that it still counts spaces as characters, so long string
would be of length 11, not 10
import textwrap
df.Name.apply(lambda x: textwrap.wrap(x, 10)).agg(pd.Series)
If you want to have two seperate columns, try:
new_df = df.Name.apply(lambda x: textwrap.wrap(x, 10))
df['name1'] = new_df.apply(lambda x: x[0])
df['name2'] = new_df.apply(lambda x: ' '.join(x[1:]))
CodePudding user response:
You can use regex for this. The regex string r'[\w\s]{1,10}'
matches a pattern of up to 10 instances of letters, digits, underscore and whitespace.
import pandas as pd
df=pd.DataFrame({"Name":["This is a long string", "This is an even longer string", "This should be the longest string of this collection"]})
max_len = df['Name'].str.len().max()
cols = [f'Name{i 1}' for i in range(int(max_len/10) 1)] #create columns names
df[cols] = df['Name'].str.findall(r'[\w\s]{1,10}').agg(pd.Series).fillna('')
df
Output:
Name | Name1 | Name2 | Name3 | Name4 | Name5 | Name6 | |
---|---|---|---|---|---|---|---|
0 | This is a long string | This is a | long strin | g | |||
1 | This is an even longer string | This is an | even long | er string | |||
2 | This should be the longest string of this coll... | This shoul | d be the l | ongest str | ing of thi | s collecti | on |
CodePudding user response:
df['Name1'] = df['Name'].str.split(' ').str[:2]
df['Name2'] = df['Name'].str.split(' ').str[2:]
CodePudding user response:
You can split each sentence using the built-in textwrap.wrap
function, feed the resulting list to a DataFrame, fill the NaNs with the empty string, and rename the columns as you wish using DataFrame.rename
:
from textwrap import wrap
df = pd.DataFrame({"Name":["This is a long string", "This an even longer string", "This is the longest string"]})
res = (
pd.DataFrame([wrap(text, 10) for text in df['Name']])
.fillna('')
.rename(columns=lambda col_idx: f'Name{col_idx 1}')
)
Output:
>>> res
Name1 Name2 Name3 Name4
0 This is a long string
1 This an even longer string
2 This is the longest string
CodePudding user response:
Try this. This does not count spaces as part of the length, which I believe was the intention
s = df['Name'].str.split().explode()
(s.groupby([pd.Grouper(level=0),s.str.len().groupby(level=0).cumsum().floordiv(10)])
.agg(' '.join)
.unstack()
.rename(lambda x: 'Name{}'.format(x 1),axis=1))
Output:
Name Name1 Name2 Name3
0 This is a long string NaN
1 This is an even longer string
2 This is the longest string