I have a DataFrame like the following
|Animals | Type | Year |
|Penguin AVES | Omnivore | 2015 |
|Caiman REP | Carnivore | 2018 |
|Komodo.Rep | Carnivore | 2019 |
|Blue Jay.aves | Omnivore | 2015 |
|Iguana rep | Carnivore | 2020 |
I want to extract the last specific words (e.g. AVES and REP) from the values in column "Animals" and move it to the next row while keeping the values of the entire row. There are several specific words other than AVES and REP. It's not very clean (as shown by the whitespace, dot, and " " operator before the specific words). The expected new DataFrame would be like the following
| Animals | Type | Year |
| Penguin AVES | Omnivore | 2015 |
| AVES | Omnivore | 2015 |
| Caiman REP | Carnivore | 2018 |
| REP | Carnivore | 2018 |
| Komodo.Rep | Carnivore | 2019 |
| Rep | Carnivore | 2019 |
| Blue Jay.aves | Omnivore | 2015 |
| aves | Omnivore | 2015 |
| Iguana rep | Carnivore | 2020 |
| rep | Carnivore | 2020 |
I was thinking of using a negative indexing to split the string, but I got confused with the lambda function for this particular issue. Any idea how I should approach this problem? Thanks in advance.
CodePudding user response:
You can use str.extract
to get the last word ((\w )$
regex, but you can also use a specific list (?i)(aves|rep)$
if needed) and assign
it to replace the column, then concat
the updated DataFrame to the original one, and sort_index
with a stable method to interleave the rows:
out = (pd.concat([df, df.assign(Animals=df['Animals'].str.extract(r'(\w )$'))])
.sort_index(kind='stable', ignore_index=True)
)
Output:
Animals Type Year
0 Penguin AVES Omnivore 2015
1 AVES Omnivore 2015
2 Caiman REP Carnivore 2018
3 REP Carnivore 2018
4 Komodo.Rep Carnivore 2019
5 Rep Carnivore 2019
6 Blue Jay.aves Omnivore 2015
7 aves Omnivore 2015
8 Iguana rep Carnivore 2020
9 rep Carnivore 2020
alternative using stack
:
cols = df.columns.difference(['Animals']).tolist()
out = (df.assign(Word=df['Animals'].str.extract(r'(\w )$'))
.set_index(cols).stack().reset_index(cols, name='Animals')
.reset_index(drop=True)[df.columns]
)
alternative with indexing:
Duplicate all rows, modify the odd rows with the extracted word
out = df.loc[df.index.repeat(2)].reset_index(drop=True)
out.loc[1::2, 'Animals'] = out.loc[1::2, 'Animals'].str.extract(r'(\w )$', expand=False)