Get the both numbers in the bracket of a string Regex Python-CodePudding

I have my 'cost_money' column like this,

0    According to different hospitals, the charging...
1    According to different hospitals, the charging...
2    According to different conditions, different h...
3    According to different hospitals, the charging...
Name: cost_money, dtype: object

Out of which each string has some important data in brackets, which I need to extract.

"According to different hospitals, the charging standard is inconsistent, the city's three hospitals is about (1000-4000 yuan)"

My try for this is,

import regex as re

full_df['cost_money'] = full_df.cost_money.str.extract('\((.*?)\')
full_df

But this gives an error between string and int conversion, I guess. This a whole string and if I print any character it is going to be char type. Other than that, I don't need 'yuan' word from the brackets so my method to extract the numbers directly was

import regex as re
df['cost_money'].apply(lambda x: re.findall(r"[- ]?\d*\.\d |\d ", x)).tolist()
full_df['cost_money']

Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
c:\Users\Siddhi\HealthcareChatbot\eda.ipynb Cell 11' in <module>
      1 import regex as re
----> 2 df['cost_money'].apply(lambda x: re.findall(r"[- ]?\d*\.\d |\d ", x)).tolist()
      3 full_df['cost_money']

File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
   4323 def apply(
   4324     self,
   4325     func: AggFuncType,
   (...)
   4328     **kwargs,
   4329 ) -> DataFrame | Series:
   4330     """
   4331     Invoke function on values of Series.
   4332 
   (...)
   4431     dtype: float64
   4432     """
-> 4433     return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\apply.py:1082, in SeriesApply.apply(self)
   1078 if isinstance(self.f, str):
   1079     # if we are a string, try to dispatch
   1080     return self.apply_str()
-> 1082 return self.apply_standard()

File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\apply.py:1137, in SeriesApply.apply_standard(self)
   1131         values = obj.astype(object)._values
   1132         # error: Argument 2 to "map_infer" has incompatible type
   1133         # "Union[Callable[..., Any], str, List[Union[Callable[..., Any], str]],
   1134         # Dict[Hashable, Union[Union[Callable[..., Any], str],
   1135         # List[Union[Callable[..., Any], str]]]]]"; expected
   1136         # "Callable[[Any], Any]"
-> 1137         mapped = lib.map_infer(
   1138             values,
   1139             f,  # type: ignore[arg-type]
   1140             convert=self.convert_dtype,
   1141         )
   1143 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1144     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1145     #  See also GH#25959 regarding EA support
   1146     return obj._constructor_expanddim(list(mapped), index=obj.index)

File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\_libs\lib.pyx:2870, in pandas._libs.lib.map_infer()

c:\Users\Siddhi\HealthcareChatbot\eda.ipynb Cell 11' in <lambda>(x)
      1 import regex as re
----> 2 df['cost_money'].apply(lambda x: re.findall(r"[- ]?\d*\.\d |\d ", x)).tolist()
      3 full_df['cost_money']

File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\regex\regex.py:338, in findall(pattern, string, flags, pos, endpos, overlapped, concurrent, timeout, ignore_unused, **kwargs)
    333 """Return a list of all matches in the string. The matches may be overlapped
    334 if overlapped is True. If one or more groups are present in the pattern,
    335 return a list of groups; this will be a list of tuples if the pattern has
    336 more than one group. Empty matches are included in the result."""
    337 pat = _compile(pattern, flags, ignore_unused, kwargs, True)
--> 338 return pat.findall(string, pos, endpos, overlapped, concurrent, timeout)

TypeError: expected string or buffer

I tried the same thing using findall but most posts mentioned using extract so I stuck to that.

MY REQUESTED OUTPUT:

[5000, 8000]
[6000, 7990]
..SO ON

Can somebody please help me out? Thanks

CodePudding user response：

You can use (\d*-\d*) to match the number part and then split on -.

df['money'] = df['cost_money'].str.extract('\((\d*-\d*).*\)')
df['money'] = df['money'].str.split('-')

Or use (\d*)[^\d]*(\d*) to match the two number parts seperately

df['money'] = df['cost_money'].str.extract('\((\d*)[^\d]*(\d*).*\)').values.tolist()

CodePudding user response：

I believe your regex was incorrect. Here are alternatives.

Example input:

df = pd.DataFrame({'cost_money': ['random text (123-456 yuans)',
                                  'other example (789 yuans)']})

Option A:

df['cost_money'].str.extract('\((\d -\d )', expand=False)

Option B (allow single cost):

df['cost_money'].str.extract('\((\d (?:-\d )?)', expand=False)

Option C (all numbers eater the first '(' as list:

df['cost_money'].str.split('[()]').str[1].str.findall('(\d )')

Output (assigned as new columns):

                    cost_money        A        B           C
0  random text (123-456 yuans)  123-456  123-456  [123, 456]
1    other example (789 yuans)      NaN      789       [789]