I have the following DataFrame:
df = pd.DataFrame({
"Name" : ["Foo", "SomeString", "Bar"],
"value1":[1, 2, 3],
"value2":[0, 1, 2]})
I want to check if a string in the 'Name' col. has a length > 4. If this is true I want to duplicate the entire row and split/slice the Name-string such that I get the following output:
df = pd.DataFrame({
"Name" : ["Foo", "Some", "String", "Bar"],
"value1":[1, 2, 2, 3],
"value2":[0, 1, 1, 2]})
CodePudding user response:
One option is to add a space between the 4th index and the 5th; then split
on it and explode
:
out = (df.assign(Name=(df['Name'].str[:4] ' ' df['Name'].str[4:]).str.split())
.explode('Name').reset_index(drop=True))
Output:
Name value1 value2
0 Foo 1 0
1 Some 2 1
2 String 2 1
3 Bar 3 2
CodePudding user response:
First you should split a string based on camel case (assuming there are only alphabetical characters used in the name), and then split and explode the dataframe as shown below:
Altogether this would be:
df['Name'] = df['Name'].apply(lambda x: re.sub('(?:([a-z])([A-Z]))', '\\1 \\2', x) if len(x) > 4 else x
df['Name'] = df['Name'].str.split()
df = df.explode("Name").reset_index(drop=True)
Output:
Name value1 value2
0 Foo 1 0
1 Some 2 1
2 String 2 1
3 Bar 3 2
The separate steps are shown below:
df['Name'] = df['Name'].apply(lambda x: re.sub('(?:([a-z])([A-Z]))', '\\1 \\2', x) if len(x) > 4 else x
Output:
>>> df
Name value1 value2
0 Foo 1 0
1 Some String 2 1
2 Bar 3 2
df['Name'] = df['Name'].str.split()
Output:
>>> df
Name value1 value2
0 [Foo] 1 0
1 [Some, String] 2 1
2 [Bar] 3 2
df.explode("Name").reset_index(drop=True)
Output:
Name value1 value2
0 Foo 1 0
1 Some 2 1
2 String 2 1
3 Bar 3 2
CodePudding user response:
You can use a regex to extract the chunks of your string, then explode:
(df
.assign(Name=df['Name'].str.findall('(?:^.{,4})|(?:. )'))
.explode('Name')
)
Then it is easy to adapt to other rules. For example to split the words on a capital letter: '[A-Z][a-z] '
output:
Name value1 value2
0 Foo 1 0
1 Some 2 1
1 String 2 1
2 Bar 3 2