Duplicating row in DataFrame and slicing a string value-CodePudding

I have the following DataFrame:

df = pd.DataFrame({
"Name" : ["Foo", "SomeString", "Bar"], 
"value1":[1, 2, 3], 
"value2":[0, 1, 2]})

I want to check if a string in the 'Name' col. has a length > 4. If this is true I want to duplicate the entire row and split/slice the Name-string such that I get the following output:

df = pd.DataFrame({
"Name" : ["Foo", "Some", "String", "Bar"], 
"value1":[1, 2, 2, 3], 
"value2":[0, 1, 1, 2]})

CodePudding user response：

One option is to add a space between the 4th index and the 5th; then split on it and explode:

out = (df.assign(Name=(df['Name'].str[:4]   ' '   df['Name'].str[4:]).str.split())
       .explode('Name').reset_index(drop=True))

Output:

     Name  value1  value2
0     Foo       1       0
1    Some       2       1
2  String       2       1
3     Bar       3       2

CodePudding user response：

First you should split a string based on camel case (assuming there are only alphabetical characters used in the name), and then split and explode the dataframe as shown below:

Altogether this would be:

df['Name'] = df['Name'].apply(lambda x: re.sub('(?:([a-z])([A-Z]))', '\\1 \\2', x) if len(x) > 4 else x
df['Name'] = df['Name'].str.split()
df = df.explode("Name").reset_index(drop=True)

Output:

     Name  value1  value2
0     Foo       1       0
1    Some       2       1
2  String       2       1
3     Bar       3       2

The separate steps are shown below:

df['Name'] = df['Name'].apply(lambda x: re.sub('(?:([a-z])([A-Z]))', '\\1 \\2', x) if len(x) > 4 else x

Output:

>>> df
          Name  value1  value2
0          Foo       1       0
1  Some String       2       1
2          Bar       3       2

df['Name'] = df['Name'].str.split()

Output:

>>> df
             Name  value1  value2
0           [Foo]       1       0
1  [Some, String]       2       1
2           [Bar]       3       2

df.explode("Name").reset_index(drop=True)

Output:

     Name  value1  value2
0     Foo       1       0
1    Some       2       1
2  String       2       1
3     Bar       3       2

CodePudding user response：

You can use a regex to extract the chunks of your string, then explode:

(df
 .assign(Name=df['Name'].str.findall('(?:^.{,4})|(?:. )'))
 .explode('Name')
)

Then it is easy to adapt to other rules. For example to split the words on a capital letter: '[A-Z][a-z] '

output:

     Name  value1  value2
0     Foo       1       0
1    Some       2       1
1  String       2       1
2     Bar       3       2