Extract number, remove duplicate and separate with a comma in Python-CodePudding

Need help extracting multiple numbers from a column in a dataframe and remove duplicates and separate them with a comma.

Col1
Abcde 10 hijk20
wewrw5 gagdhdh5
Mnbjgkh10,20, 30

Expected output;

Col2
10,20
5
10,20,30

CodePudding user response：

Try this:

punctuations = ['!','(',')','-','[',']','{','}',';',':','"','<','>','.','/','?']
for index, row in dataframe.iterrows():
    content = dataframe.iloc[index:index, column_index]
    for p in punctuations:
        content.replace(p, " ")
    only_numbers = re.sub("[^0-9]", " ", content)
    only_numbers.strip()
    numbers_found = only_numbers.split(" ")
    no_duplicates = list(set(numbers_found))
    comma_separated = ",".join(no_duplicates)
    dataframe.iloc[index:index, column_index] = comma_separated

CodePudding user response：

Does this answer your question? findall() from the re module with the regular expression r'\d ' returns a list containing all non-overlapping matches of one or more consecutive decimal digits in the string. The built-in set() removes any duplicates from that list and applying the sorted() built-in returns a sorted list of the elements in the set. We also make use of numpy.vectorize as it is faster than apply() from Pandas for this particular application (at least on my system) though I have shown how to use apply() as well.

Method 1

import pandas as pd
import numpy as np 
import re

# compile RE - matches one or more decimal digits
p = re.compile(r'\d ')

# data
d = {'col1': ['Abcde 10 hijk20', 'wewrw5 gagdhdh5', 'Mnbjgkh10,20, 30'],
     'col2': [''] * 3}

# DataFrame
df = pd.DataFrame(d)

# modify col2 based on col1
df['col2'] = np.vectorize(
    lambda y:  ','.join(sorted(set(p.findall(y)))),
    )(df['col1'])

print(df)

Output

               col1      col2
0   Abcde 10 hijk20     10,20
1   wewrw5 gagdhdh5         5
2  Mnbjgkh10,20, 30  10,20,30

If you can only use pandas and not numpy, you can do

Method 2

# modify col2 based on col1
df['col2'] = df.apply(
    lambda x: ','.join(sorted(set(p.findall(x['col1'])))) , axis=1)

or even Method 3

# modify col2 based on col1
for index, row in df.iterrows():
    row['col2'] = ','.join(sorted(set(p.findall(row['col1']))))

Efficiency

On my system, vectorize (method 1) is fastest, method 3 is second fastest and method 2 is the slowest.

# Method 1
82.9 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Method 2
399 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# Method 3
117 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)