This is a follow up from my previous question. I have a pandas dataFrame with value that I would like to extract. I do not know the value since it changes, but I know its aproximate location (between the words 'tax' and 'USD'). The partial code is below
import pandas as pd
d={'75 shares ExD 2022-09-14 PD 2022-10-11 dividend MO.NYSE 70.50 USD (0.94 per
share) tax -10.58 USD (-15.000%) DivCntry US USIncmCode 06'}
df = pd.DataFrame(d.values(), columns=['description'])
df['tax'] = df['description'].str.extract(????)
df[['tax','description']]
print(df)
CodePudding user response:
Your input is not reproducible, but I imagine that you need:
df['tax'] = pd.to_numeric(df['description'].str.extract(r'tax\s*(-?\d (?:\.\d*))\s*USD', expand=False))
Output:
description tax
0 75 shares ExD 2022-09-14 PD 2022-10-11 dividen... -10.58
CodePudding user response:
We can use Regex to solve this problem
d={'75 shares ExD 2022-09-14 PD 2022-10-11 dividend MO.NYSE 70.50 USD (0.94 per share) tax -10.58 USD (-15.000%) DivCntry US USIncmCode 06'}
df = pd.DataFrame(d, columns=['description'])
# pattern which we want to identify
tax_pattern = r'(?<=tax)(.*)(?=USD)'
# extracting pattern
df['tax'] = df['description'].str.extract(tax_pattern)
# removing spaces
df['tax'] = df['tax'].apply(lambda x: x.strip())
# changing datatype
df['tax'] = df['tax'].astype(float)
df[['tax','description']]
Output