Pandas Split a Column by Multiple delimiters into same column-CodePudding

I have a pandas dataframe of multiple columns. But the Column of interest say is Col A which looks like :

      dfInput

               A 

       12 - ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns); 891- 1o19jsksuisolslskosBN
       205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu
       22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi

So my goal is to split the strings into 2 parts one by the first '-' before/after the semicolon and then by the rest where the first column will have all the numbers on the left of first '-' and then the other column will have rest . So my output dataframe should look like:

        **dfOutDesired**

           B                                          C
         12 891    ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns) 1o19jsksuisolslskosBN
         205 31    lksiosslsoujhs%ysus(0.33mismsn)  jsks,msnu
         22 14 67181  0204 - 2762.0uhsyuskis gnshj,msuis,lsolso2.2n iuwuwiwi

So till now I have tried the following based on earlier posts I saw:

      df['C'] =    df["A"].str.replace(r'\w \s*-\s*', '', regex=True)

      df['A'] = df['A'].str.replace(' ', '')
      df['A'] = df['A'].str.replace('-', ' - ')

      df["B"] = df["A"].str.replace(r'\s*;\s*\S ; \S ; \s*', ' ', regex=True).str.strip()

So I am getting partially correct output for Col C (where say 0204 in third row is not picked up) , but for Col B I am just getting the same Col A as follows:

       **mydfWrong**

                                B     
       12 - ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns); 891- 1o19jsksuisolslskosBN
       205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu
       22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi

                               C

            ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns) 1o19jsksuisolslskosBN
            lksiosslsoujhs%ysus(0.33mismsn)  jsks,msnu
            2762.0uhsyuskis gnshj,msuis,lsolso2.2n iuwuwiwi

Not sure how to use the regex properly in this case. So any help will be immensely appreciated. Thanks in advance.

CodePudding user response：

You can do that with something like the following.

The get_nums function splits on the ;, then removes the integers based on the specified conditions

The get_the_rest function does the opposite, and substitutes those values with empty spaces

def get_nums(x):
    return " ".join([re.search(r'^[0-9] ', y.strip()).group() for y in x.split(";")])

def get_the_rest(x):
    return "".join([re.sub(r'[0-9] \s?\-', '', y.strip()) for y in x.split(";")])

df["B"] = df["A"].apply(lambda x: get_nums(x))
df["C"] = df["A"].apply(lambda x: get_the_rest(x))

CodePudding user response：

You can use a single regex call to findall and then grab only the necessary data:

import pandas as pd
df = pd.DataFrame({'A': ['205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu', '22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi']})

rx = r'(\d )\s*-\s*(.*?)(?=;\s*\d|$)' # Define the regex
m = df['A'].str.findall(rx)         # Get all matches and captures
df["B"] = m.apply(lambda x: " ".join(c[0] for c in x)) # Join Group 1 values into Column B
df["C"] = m.apply(lambda x: " ".join(c[1] for c in x)) # Join Group 1 values into Column C

Output:

>>> print(df.to_string())
                                                                            A            B                                                       C
0                       205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu       205 31               lksiosslsoujhs%ysus(0.33mismsn) jsks,msnu
1  22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi  22 14 67181  0204 - 2762.0uhsyuskis gnshj,msuis,lsolso2.2n iuwuwiwi

See the regex demo that matches

(\d ) - Group 1: one or more digits
\s*-\s* - a - enclosed with zero or more whitespaces
(.*?) - Group 2: any zero or more chars other than line break chars as few as possible
(?=;\s*\d|$) - positive lookahead that matches a location that is immediately followed with ; zero or more whitespaces and then a digit, or end of string.