For a dataframe, I am trying to extract all occurrences of "cash" and then n characters after them (which contains the cash amount). I have tried JSON, Regex, but they do not work as this dataframe is quite inconsistent.
So for example,
sample = pd.DataFrame({'LongString': ["I am trying to find out how much cash 15906810
and this needs to be consistent cash : 69105060",
"other words that are wrong cash : 11234 and more words cash 1526
"]})
And then my dataframe will look like
sample_resolved = pd.DataFrame({'LongString': ["I am trying to find out how much cash 15906810
and this needs to be consistent cash : 69105060",
"other words that are wrong cash : 11234 and more words cash 1526
"], 'cash_string' = ["cash 15906810 cash : 69105060", "cash : 11234 cash 1526]})
Each row of the dataframe is inconsistent. The ultimate goal is to create a new column that has all instances of "cash" followed by let's say 8-10 characters after it.
The ultimate goal would be to have a line that goes
df['cash_string'] = df['LongString'].str.findall('cash')
(but also includes the n characters after each 'cash' instance)
Thank you!
CodePudding user response:
In general, if there isn't a dataframe method (or combination thereof) that does what you're after, you can write a function that works on a single example and then pass it to the dataframe with series.apply(some_func)
.
So, a function that does what you're looking for:
def str_after_substr(s, substr='cash', offset=5):
i = s.index(substr)
start = i len(substr)
return s[start:start offset]
# test
str_after_substr('moneymoneycashmoneyhoney')
# create the new column values and add it to the df
df['new_column] = df['old_column'].apply(str_after_substr)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
CodePudding user response:
Example
make minimal and reproducible example
df = pd.DataFrame(["abc cash : 1590 cde cash : 6910", "fgh cash : 1890 hij cash : 3410 cash : 4510"], columns=['col1'])
df
col1
0 abc cash : 1590 cde cash : 6910
1 fgh cash : 1890 hij cash : 3410 cash : 4510
Code
s = df['col1'].str.extractall(r'(cash : \d )')[0]
s
match
0 0 cash : 1590
1 cash : 6910
1 0 cash : 1890
1 cash : 3410
2 cash : 4510
Name: 0, dtype: object
s.groupby(level=0).agg(', '.join)
0 cash : 1590, cash : 6910
1 cash : 1890, cash : 3410, cash : 4510
Name: 0, dtype: object
Output
df.assign(col2=s.groupby(level=0).agg(', '.join))
col1 col2
0 abc cash : 1590 cde cash : 6910 cash : 1590, cash : 6910
1 fgh cash : 1890 hij cash : 3410 cash : 4510 cash : 1890, cash : 3410, cash : 4510
CodePudding user response:
To add on to @JCThomas 's answer, I'd change the str_after_substr function like below
def cash_finder(s, substr='cash', offset=10):
ss = s.split(substr)
cashlist = []
for i in ss[1:]:
cashlist.append(int(''.join([x for x in list(i[:offset].strip()) if re.match('\d',x) ])))
return cashlist
This will give you all instances of cash in one sentence,
and, df operation will go like below.
ddf['cashstring'] = ddf['LongString'].apply(lambda x: [{'cash':i} for i in cash_finder(x)])