Home > Blockchain >  how to extract specific sentence in pandas dataframe?
how to extract specific sentence in pandas dataframe?

Time:11-04

I'm trying to extract specific sentence in string column and copy&paste it to Nan rows in another column

here is the detail:

`

        KEY     WARNSIGN_DTL    EVENT_DTL   
55116   77981   Nan             1. Name: Johnson 2. Date: 09 June 2018 3. Place to be found: Wall street 4. Found in his apartment 5. Method: overdose Pentanil
56545   73565   Nan             1. Name: Hay 2. Date: 04 November 2021 3. Place to be found: Plano 75075 4. Found in her car 5. Method: hanging up his neck
56780   77844   3. Place to be found: Coppell 4. Found in his wife's apartment     1. Name: Beard 2. Date: 05 November 2021 3. Place to be found: Coppell 4. Found in his wife's house 5. shooting gun on his head

`


I want to extract 3. Place to be found: xxx 4. Found in xxx xxx string out of EVENT_DTL and put it in WARNSIGN_DTL column in which the rows with Nan(missin value).

Here is the desired output:

```
`
```
         KEY    WARNSIGN_DTL    EVENT_DTL   
55116   77981   3. Place to be found: Wall street 4. Found in his apartment 1. Name: Johnson 2. Date: 09 June 2018 3. Place to be found: Wall street 4. Found in his apartment 5. Method: overdose Pentanil
56545   73565   3. Place to be found: Plano 75075 4. Found in her car   1. Name: Hay 2. Date: 04 November 2021 3. Place to be found: Plano 75075 4. Found in her car 5. Method: hanging up his neck
56780   77844   3. Place to be found: Coppell 4. Found in his wife's house     1. Name: Beard 2. Date: 05 November 2021 3. Place to be found: Coppell 4. Found in his wife's house 5. shooting gun on his head
```
`
```

CodePudding user response:

Use str.extract():

df = pd.DataFrame({'KEY': [77981, 73565, 77844],
                   'WARNSIGN_DTL': [np.nan, np.nan, "3. Place to be found: Coppell 4. Found in his wife's apartment"],
                   'EVENT_DTL': [
                       "1. Name: Johnson 2. Date: 09 June 2018 3. Place to be found: Wall street 4. Found in his apartment 5. Method: overdose Pentanil",
                       "1. Name: Hay 2. Date: 04 November 2021 3. Place to be found: Plano 75075 4. Found in her car 5. Method: hanging up his neck",
                       "1. Name: Beard 2. Date: 05 November 2021 3. Place to be found: Coppell 4. Found in his wife's house 5. shooting gun on his head"]})

df.loc[idx, 'WARNSIGN_DTL'] = df.loc[(idx := df.WARNSIGN_DTL.isna()), 'EVENT_DTL'].str.extract(r'(3. Place to be found:. 4. Found in. )5', expand=False)
print(df)
     KEY                                       WARNSIGN_DTL                                          EVENT_DTL
0  77981  3. Place to be found: Wall street 4. Found in ...  1. Name: Johnson 2. Date: 09 June 2018 3. Plac...
1  73565  3. Place to be found: Plano 75075 4. Found in ...  1. Name: Hay 2. Date: 04 November 2021 3. Plac...
2  77844  3. Place to be found: Coppell 4. Found in his ...  1. Name: Beard 2. Date: 05 November 2021 3. Pl...

CodePudding user response:

Use the below snippet:

import re

def extract_my_string(my_str):
  """
  Func that uses regex to extract content between 3. and 5.
  """
  result = re.search('3.(.*)5.', my_str)
  return f'3.{result.group(1)}'
  


df['WARNSIGN_DTL'] = df['EVENT_DTL'].apply(lambda x: extract_my_string(x)) 
  • Related