I have a DataFrame like the following
Animals | Type | Year |
---|---|---|
Penguin AVES | Omnivore | 2015 |
Caiman REP | Carnivore | 2018 |
Komodo.Rep | Carnivore | 2019 |
Blue Jay.aves | Omnivore | 2015 |
Peregrine aves Falcon | Carnivore | 2016 |
Iguana rep | Carnivore | 2020 |
Rep Salamander | Carnivore | 2019 |
I want to extract the specific words (e.g. AVES and REP) from the values in column "Animals" and move it to the next row while keeping the values of the entire row. There are several specific words other than AVES and REP. It's not very clean (as shown by the whitespace, dot, and " " operator before the specific words). The expected new DataFrame would be like the following.
Animals | Type | Year |
---|---|---|
Penguin AVES | Omnivore | 2015 |
AVES | Omnivore | 2015 |
Caiman REP | Carnivore | 2018 |
REP | Carnivore | 2018 |
Komodo.Rep | Carnivore | 2019 |
Rep | Carnivore | 2019 |
Blue Jay.aves | Omnivore | 2015 |
aves | Omnivore | 2015 |
Peregrine aves Falcon | Carnivore | 2016 |
aves | Carnivore | 2016 |
Iguana rep | Carnivore | 2020 |
rep | Carnivore | 2020 |
Rep Salamander | Carnivore | 2019 |
Rep | Carnivore | 2019 |
I've successfully extracted the specific words located at the end using the following code provided by @mozway
out = (pd.concat([df, df.assign(Animals=df['Animals'].str.extract(r'(\w )$'))]) .sort_index(kind='stable', ignore_index=True) )
but I still have no clue on how to extract the specific words from the middle (refers to Peregrine aves Falcon) and the start (refers to Rep Salamander). I'm planning to use regex as I find it more flexible for my DataFrame, but I just started using Python and have no experience with regex. How should I approach this problem? Thanks in advance.
CodePudding user response:
A variation of my previous answer using a whitelist of words:
import re
words = ['aves', 'rep']
pattern = '|'.join(map(re.escape, words))
out = df.loc[df.index.repeat(2)].reset_index(drop=True)
out.loc[1::2, 'Animals'] = out.loc[1::2, 'Animals'].str.extract(fr'\b({pattern})\b', flags=re.I, expand=False)
Output:
Animals Type Year
0 Penguin AVES Omnivore 2015
1 AVES Omnivore 2015
2 Caiman REP Carnivore 2018
3 REP Carnivore 2018
4 Komodo.Rep Carnivore 2019
5 Rep Carnivore 2019
6 Blue Jay.aves Omnivore 2015
7 aves Omnivore 2015
8 Peregrine aves Falcon Carnivore 2016
9 aves Carnivore 2016
10 Iguana rep Carnivore 2020
11 rep Carnivore 2020
12 Rep Salamander Carnivore 2019
13 Rep Carnivore 2019