original data frame:
Date | Detail |
---|---|
31/03/22 | I watch Netflix at home with my family 4 hours |
01/04/22 | I walk to the market for 3km and I spent 11.54 dollar |
02/04/22 | my dog bite me, I go to hospital, spend 29.99 dollar |
03/04/22 | I bought a game on steam 7 games spen 19.23 dollar |
result data frame:
Date | Detail | Cost |
---|---|---|
31/03/22 | I watch Netflix at home with my family 4 hours | 0 |
01/04/22 | I walk to the market for 3km and I spent 11.54 dollar | 11.54 |
02/04/22 | my dog bite me, I go to hospital, spend 29.99 dollar | 29.99 |
03/04/22 | I bought a game on steam 7 games spen 19.23 dollar | 19.23 |
Describe my question:
If Detail Column does not contain specific string which is begin with sp.. and end with dollar then value in Cost col equal zero.
If Detail Column does contain specific string which is begin with sp.. and end with dollar, then value in Cost col equal value in the middle of specific string which is begin with sp.. and end with dollar.
I try to use regex but it's got first int that contain in the col like
| 01/04/22 | I walk to the market for 3km and I spent 11.54 dollar| 3 |
CodePudding user response:
You should be able to use a regex pattern of a form such as:
df['Cost'] = df['Detail'].str.extract(r'sp\D*([\d\.]*)\D*dollar')
This will look for the literal string sp
and then any non-digit characters after it. The capture group (denoted by the ()
) looks for any digits or period characters, representing the dollar amount. This is what is returned to the Cost
column. The final part of the pattern allows any number of non-digit characters after the dollar amount, followed by the literal string dollar
.
The pd.NA
for rows which don't have a cost can then be replaced with 0
:
df['Cost'] = df['Cost'].replace({pd.NA: 0})
If you want to make any enhancements I used this site to test the regex: https://regexr.com/6ir6o