Need help extracting multiple numbers from a column in a dataframe and remove duplicates and separate them with a comma.
Col1 |
---|
Abcde 10 hijk20 |
wewrw5 gagdhdh5 |
Mnbjgkh10,20, 30 |
Expected output;
Col2 |
---|
10,20 |
5 |
10,20,30 |
CodePudding user response:
Try this:
punctuations = ['!','(',')','-','[',']','{','}',';',':','"','<','>','.','/','?']
for index, row in dataframe.iterrows():
content = dataframe.iloc[index:index, column_index]
for p in punctuations:
content.replace(p, " ")
only_numbers = re.sub("[^0-9]", " ", content)
only_numbers.strip()
numbers_found = only_numbers.split(" ")
no_duplicates = list(set(numbers_found))
comma_separated = ",".join(no_duplicates)
dataframe.iloc[index:index, column_index] = comma_separated
CodePudding user response:
Does this answer your question? findall()
from the re
module with the regular expression r'\d '
returns a list containing all non-overlapping matches of one or more consecutive decimal digits in the string. The built-in set()
removes any duplicates from that list and applying the sorted()
built-in returns a sorted list of the elements in the set. We also make use of numpy.vectorize
as it is faster than apply()
from Pandas for this particular application (at least on my system) though I have shown how to use apply()
as well.
Method 1
import pandas as pd
import numpy as np
import re
# compile RE - matches one or more decimal digits
p = re.compile(r'\d ')
# data
d = {'col1': ['Abcde 10 hijk20', 'wewrw5 gagdhdh5', 'Mnbjgkh10,20, 30'],
'col2': [''] * 3}
# DataFrame
df = pd.DataFrame(d)
# modify col2 based on col1
df['col2'] = np.vectorize(
lambda y: ','.join(sorted(set(p.findall(y)))),
)(df['col1'])
print(df)
Output
col1 col2
0 Abcde 10 hijk20 10,20
1 wewrw5 gagdhdh5 5
2 Mnbjgkh10,20, 30 10,20,30
If you can only use pandas
and not numpy
, you can do
Method 2
# modify col2 based on col1
df['col2'] = df.apply(
lambda x: ','.join(sorted(set(p.findall(x['col1'])))) , axis=1)
or even Method 3
# modify col2 based on col1
for index, row in df.iterrows():
row['col2'] = ','.join(sorted(set(p.findall(row['col1']))))
Efficiency
On my system, vectorize
(method 1) is fastest, method 3 is second fastest and method 2 is the slowest.
# Method 1
82.9 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Method 2
399 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# Method 3
117 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)