Extracting a value from a pandas DataFrame-CodePudding

This is a follow up from my previous question. I have a pandas dataFrame with value that I would like to extract. I do not know the value since it changes, but I know its aproximate location (between the words 'tax' and 'USD'). The partial code is below

import pandas as pd
d={'75 shares ExD 2022-09-14 PD 2022-10-11 dividend MO.NYSE 70.50 USD (0.94 per 
share) tax -10.58 USD (-15.000%) DivCntry US USIncmCode 06'}

df = pd.DataFrame(d.values(), columns=['description'])
df['tax'] = df['description'].str.extract(????)
df[['tax','description']]
print(df)

CodePudding user response：

Your input is not reproducible, but I imagine that you need:

df['tax'] = pd.to_numeric(df['description'].str.extract(r'tax\s*(-?\d (?:\.\d*))\s*USD', expand=False))

Output:

                                         description    tax
0  75 shares ExD 2022-09-14 PD 2022-10-11 dividen... -10.58

CodePudding user response：

We can use Regex to solve this problem

d={'75 shares ExD 2022-09-14 PD 2022-10-11 dividend MO.NYSE 70.50 USD (0.94 per share) tax -10.58 USD (-15.000%) DivCntry US USIncmCode 06'}
df = pd.DataFrame(d, columns=['description'])

# pattern which we want to identify
tax_pattern = r'(?<=tax)(.*)(?=USD)'

# extracting pattern
df['tax'] = df['description'].str.extract(tax_pattern)

# removing spaces
df['tax'] = df['tax'].apply(lambda x: x.strip())

# changing datatype
df['tax'] = df['tax'].astype(float)
df[['tax','description']]

Output