Home > Blockchain >  Extract last specific word/value from one column and move it to the next row
Extract last specific word/value from one column and move it to the next row

Time:01-26

I have a DataFrame like the following

|Animals        | Type         | Year |
|Penguin AVES   | Omnivore     | 2015 |
|Caiman REP     | Carnivore    | 2018 |
|Komodo.Rep     | Carnivore    | 2019 |
|Blue Jay.aves  | Omnivore     | 2015 |
|Iguana rep     | Carnivore    | 2020 |

I want to extract the last specific words (e.g. AVES and REP) from the values in column "Animals" and move it to the next row while keeping the values of the entire row. There are several specific words other than AVES and REP. It's not very clean (as shown by the whitespace, dot, and " " operator before the specific words). The expected new DataFrame would be like the following

| Animals        | Type         | Year |
| Penguin AVES   | Omnivore     | 2015 |
| AVES           | Omnivore     | 2015 |
| Caiman REP     | Carnivore    | 2018 |
| REP            | Carnivore    | 2018 |
| Komodo.Rep     | Carnivore    | 2019 |
| Rep            | Carnivore    | 2019 |
| Blue Jay.aves  | Omnivore     | 2015 |
| aves           | Omnivore     | 2015 |
| Iguana rep     | Carnivore    | 2020 |
| rep            | Carnivore    | 2020 |

I was thinking of using a negative indexing to split the string, but I got confused with the lambda function for this particular issue. Any idea how I should approach this problem? Thanks in advance.

CodePudding user response:

You can use str.extract to get the last word ((\w )$ regex, but you can also use a specific list (?i)(aves|rep)$ if needed) and assign it to replace the column, then concat the updated DataFrame to the original one, and sort_index with a stable method to interleave the rows:

out = (pd.concat([df, df.assign(Animals=df['Animals'].str.extract(r'(\w )$'))])
         .sort_index(kind='stable', ignore_index=True)
      )

Output:

         Animals       Type  Year
0   Penguin AVES   Omnivore  2015
1           AVES   Omnivore  2015
2     Caiman REP  Carnivore  2018
3            REP  Carnivore  2018
4     Komodo.Rep  Carnivore  2019
5            Rep  Carnivore  2019
6  Blue Jay.aves   Omnivore  2015
7           aves   Omnivore  2015
8     Iguana rep  Carnivore  2020
9            rep  Carnivore  2020
alternative using stack:
cols = df.columns.difference(['Animals']).tolist()

out = (df.assign(Word=df['Animals'].str.extract(r'(\w )$'))
         .set_index(cols).stack().reset_index(cols, name='Animals')
         .reset_index(drop=True)[df.columns]
      )
alternative with indexing:

Duplicate all rows, modify the odd rows with the extracted word

out = df.loc[df.index.repeat(2)].reset_index(drop=True)

out.loc[1::2, 'Animals'] = out.loc[1::2, 'Animals'].str.extract(r'(\w )$', expand=False)
  • Related