Home > Net >  Extract specific words from one column and move it to the next row
Extract specific words from one column and move it to the next row

Time:01-27

I have a DataFrame like the following

Animals Type Year
Penguin AVES Omnivore 2015
Caiman REP Carnivore 2018
Komodo.Rep Carnivore 2019
Blue Jay.aves Omnivore 2015
Peregrine aves Falcon Carnivore 2016
Iguana rep Carnivore 2020
Rep Salamander Carnivore 2019

I want to extract the specific words (e.g. AVES and REP) from the values in column "Animals" and move it to the next row while keeping the values of the entire row. There are several specific words other than AVES and REP. It's not very clean (as shown by the whitespace, dot, and " " operator before the specific words). The expected new DataFrame would be like the following.

Animals Type Year
Penguin AVES Omnivore 2015
AVES Omnivore 2015
Caiman REP Carnivore 2018
REP Carnivore 2018
Komodo.Rep Carnivore 2019
Rep Carnivore 2019
Blue Jay.aves Omnivore 2015
aves Omnivore 2015
Peregrine aves Falcon Carnivore 2016
aves Carnivore 2016
Iguana rep Carnivore 2020
rep Carnivore 2020
Rep Salamander Carnivore 2019
Rep Carnivore 2019

I've successfully extracted the specific words located at the end using the following code provided by @mozway

out = (pd.concat([df, df.assign(Animals=df['Animals'].str.extract(r'(\w )$'))]) .sort_index(kind='stable', ignore_index=True) )

but I still have no clue on how to extract the specific words from the middle (refers to Peregrine aves Falcon) and the start (refers to Rep Salamander). I'm planning to use regex as I find it more flexible for my DataFrame, but I just started using Python and have no experience with regex. How should I approach this problem? Thanks in advance.

CodePudding user response:

A variation of my previous answer using a whitelist of words:

import re

words = ['aves', 'rep']

pattern = '|'.join(map(re.escape, words))

out = df.loc[df.index.repeat(2)].reset_index(drop=True)

out.loc[1::2, 'Animals'] = out.loc[1::2, 'Animals'].str.extract(fr'\b({pattern})\b', flags=re.I, expand=False)

Output:

                  Animals       Type  Year
0            Penguin AVES   Omnivore  2015
1                    AVES   Omnivore  2015
2              Caiman REP  Carnivore  2018
3                     REP  Carnivore  2018
4              Komodo.Rep  Carnivore  2019
5                     Rep  Carnivore  2019
6           Blue Jay.aves   Omnivore  2015
7                    aves   Omnivore  2015
8   Peregrine aves Falcon  Carnivore  2016
9                    aves  Carnivore  2016
10             Iguana rep  Carnivore  2020
11                    rep  Carnivore  2020
12         Rep Salamander  Carnivore  2019
13                    Rep  Carnivore  2019

regex demo

  • Related