I have a dataframe which contains data that is entered without whitespace between words, i.e VolvoSaabVauxhall = Volvo Saab Vauxhall. I want to separate each word and insert new rows for each word that contain same data as row they originated from -
type year colour
Mazda 1990 Cyan
VolvoSaabVauxhall 2000 Red
Lada 1980 Black
becomes
type year colour
Mazda 1990 Cyan
Volvo 2000 Red
Saab 2000 Red
Vauxhall 2000 Red
Lada 1980 Black
what is the best way to achieve this without using iteration?
CodePudding user response:
you say
it's always a Capital letter preceded by a lower case letter
so a regex for it is
[A-Z][^A-Z]*
capital letter ([A-Z]), followed by zero or more (*) not-capital letters (with ^).
- So we can
findall
such matches - Assign the
type
column to be the list of those findings - explode the type column
So in code:
df.type = df.type.str.findall(r"[A-Z][^A-Z]*")
df = df.explode("type", ignore_index=True)
(if you want to be cool to write it in one line, lookup assign
)
sample run:
In [436]: df
Out[436]:
type year colour
0 Mazda 1990 Cyan
1 VolvoSaabVauxhall 2000 Red
2 Lada 1980 Black
In [437]: df.type.str.findall(r"[A-Z][^A-Z]*")
Out[437]:
0 [Mazda]
1 [Volvo, Saab, Vauxhall]
2 [Lada]
Name: type, dtype: object
In [438]: df.type = df.type.str.findall(r"[A-Z][^A-Z]*")
In [439]: df
Out[439]:
type year colour
0 [Mazda] 1990 Cyan
1 [Volvo, Saab, Vauxhall] 2000 Red
2 [Lada] 1980 Black
In [440]: df.explode("type", ignore_index=True)
Out[440]:
type year colour
0 Mazda 1990 Cyan
1 Volvo 2000 Red
2 Saab 2000 Red
3 Vauxhall 2000 Red
4 Lada 1980 Black
I passed ignore_index=True
to explode
; you can unpass and see what it does!
CodePudding user response:
You can use .str.extractall('([A-Z][^A-Z] )')
to extract all the uppercase word and join
it to reset dataframe
out = ((df.pop('type').str.extractall('([A-Z][^A-Z] )').droplevel(-1))
.join(df).reset_index(drop=True).rename(columns={0: 'type'}))
print(out)
type year colour
0 Mazda 1990 Cyan
1 Volvo 2000 Red
2 Saab 2000 Red
3 Vauxhall 2000 Red
4 Lada 1980 Black
CodePudding user response:
Another possible solution, based on the idea of splitting multiple types into lists (pandas.Series.str.split
) and exploding those lists afterwards:
(df.assign(type = df.type.str.split('(?<=[a-z])(?=[A-Z])', regex = True))
.explode('type'))
Output:
type year colour
0 Mazda 1990 Cyan
1 Volvo 2000 Red
1 Saab 2000 Red
1 Vauxhall 2000 Red
2 Lada 1980 Black