How can I split sub-strings to new rows of a dataframe?-CodePudding

I have a dataframe which contains data that is entered without whitespace between words, i.e VolvoSaabVauxhall = Volvo Saab Vauxhall. I want to separate each word and insert new rows for each word that contain same data as row they originated from -

type                year colour
Mazda               1990 Cyan
VolvoSaabVauxhall   2000 Red
Lada                1980 Black

becomes

type          year  colour
Mazda         1990  Cyan
Volvo         2000  Red
Saab          2000  Red
Vauxhall      2000  Red
Lada          1980  Black

what is the best way to achieve this without using iteration?

CodePudding user response：

you say

it's always a Capital letter preceded by a lower case letter

so a regex for it is

[A-Z][^A-Z]*

capital letter ([A-Z]), followed by zero or more (*) not-capital letters (with ^).

So we can findall such matches
Assign the type column to be the list of those findings
explode the type column

So in code:

df.type = df.type.str.findall(r"[A-Z][^A-Z]*")
df      = df.explode("type", ignore_index=True)

(if you want to be cool to write it in one line, lookup assign)

sample run:

In [436]: df
Out[436]:
                type  year colour
0              Mazda  1990   Cyan
1  VolvoSaabVauxhall  2000    Red
2               Lada  1980  Black

In [437]: df.type.str.findall(r"[A-Z][^A-Z]*")
Out[437]:
0                    [Mazda]
1    [Volvo, Saab, Vauxhall]
2                     [Lada]
Name: type, dtype: object

In [438]: df.type = df.type.str.findall(r"[A-Z][^A-Z]*")

In [439]: df
Out[439]:
                      type  year colour
0                  [Mazda]  1990   Cyan
1  [Volvo, Saab, Vauxhall]  2000    Red
2                   [Lada]  1980  Black

In [440]: df.explode("type", ignore_index=True)
Out[440]:
       type  year colour
0     Mazda  1990   Cyan
1     Volvo  2000    Red
2      Saab  2000    Red
3  Vauxhall  2000    Red
4      Lada  1980  Black

I passed ignore_index=True to explode; you can unpass and see what it does!

CodePudding user response：

You can use .str.extractall('([A-Z][^A-Z] )') to extract all the uppercase word and join it to reset dataframe

out = ((df.pop('type').str.extractall('([A-Z][^A-Z] )').droplevel(-1))
       .join(df).reset_index(drop=True).rename(columns={0: 'type'}))

print(out)

       type  year colour
0     Mazda  1990   Cyan
1     Volvo  2000    Red
2      Saab  2000    Red
3  Vauxhall  2000    Red
4      Lada  1980  Black

CodePudding user response：

Another possible solution, based on the idea of splitting multiple types into lists (pandas.Series.str.split) and exploding those lists afterwards:

(df.assign(type = df.type.str.split('(?<=[a-z])(?=[A-Z])', regex = True))
  .explode('type'))

Output:

       type  year colour
0     Mazda  1990   Cyan
1     Volvo  2000    Red
1      Saab  2000    Red
1  Vauxhall  2000    Red
2      Lada  1980  Black