Currently I'm struggling to find an elegant way to program a problem I'm facing.
I have a large dataframe, containing a column with Department names:
INPUT
demo = pd.DataFrame(
{'Department':
['AA','AA1','BB team 1','AA but also a bit of nonsense',
'BB','AA','department BB','Complete nonsense']}
)
Department
AA
AA1
BB team 1
AA but also a bit of nonsense
BB
AA
department BB
Complete nonsense
I also have a list of known departments:
known_departments = ['AA','BB']
As one can see, there are three type of departments in the list:
- Departments that are exact matches with known departments, these should stay untouched.
- Departments that are variants of the known departments. That is: it contains the department name, but some other text is present. These should be mapped to the original known departments.
- Complete nonsense departments, with no match whatsoever with the known departments, these should stay untouched as well.
Desired output
Department Department_simplified
AA AA
AA1 AA
BB team 1 BB
AA but also a bit of nonsense AA
BB BB
AA AA
department BB BB
Complete nonsense Complete nonsense
UPDATE
Thanks Chris and sophocles for your answers. While looking more elegant by using str.extract
and str.findall
, performance-wise the apply function outperforms both on my actual df:
Solution %%timeit -n20
Chris 1.65s ± 311 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
sophocles 1.14s ± 294 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
Paul 680 ms ± 174 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
CodePudding user response:
You can use str.extract
here, and build a list of deparments separated by |
(or) as the pattern.
import pandas as pd
known_departments = ['AA','BB']
demo = pd.DataFrame(
{'Department':
['AA','AA1','BB team 1','AA but also a bit of nonsense',
'BB','AA','department BB','Complete nonsense']}
)
demo['Department_simplified'] = demo.Department.str.extract(f"({'|'.join(known_departments)})")
# If you need to fill nulls with the original dept name
demo['Department_simplified'].fillna(demo['Department'], inplace=True)
print(demo)
Output
Department Department_simplified
0 AA AA
1 AA1 AA
2 BB team 1 BB
3 AA but also a bit of nonsense AA
4 BB BB
5 AA AA
6 department BB BB
7 Complete nonsense Complete nonsense
CodePudding user response:
You can first use str.findall
to return the matching substrings of your Department column with your list elements (known_departments). For the ones that nothing is returned, you just use the values in Department, as there's no match whatsoever
demo['Department_simplified'] = demo['Department']\
.str.findall('|'.join(known_departments)).str.join('')
demo['Department_simplified'] = np.where(
demo['Department_simplified'].eq(''),demo['Department'],demo['Department_simplified'])
Prints:
Department Department_simplified
0 AA AA
1 AA1 AA
2 BB team 1 BB
3 AA but also a bit of nonsense AA
4 BB BB
5 AA AA
6 department BB BB
7 Complete nonsense Complete nonsense
CodePudding user response:
I currently use apply in combination with a function to get my result.
Code:
def item_in_string(string, list_of_items):
for item in list_of_items:
if item in string:
return item
return string
demo['Department_simplified'] = demo.Department.apply(
lambda x: item_in_string(x, known_departments) if isinstance(x, str) else x)
However, this does not feel very efficient, nor pythonic.
I was wondering if someone has a better way to program this problem.