Home > Mobile >  Pandas Split a Column by Multiple delimiters into same column
Pandas Split a Column by Multiple delimiters into same column

Time:02-17

I have a pandas dataframe of multiple columns. But the Column of interest say is Col A which looks like :

      dfInput

               A 

       12 - ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns); 891- 1o19jsksuisolslskosBN
       205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu
       22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi

So my goal is to split the strings into 2 parts one by the first '-' before/after the semicolon and then by the rest where the first column will have all the numbers on the left of first '-' and then the other column will have rest . So my output dataframe should look like:

        **dfOutDesired**

           B                                          C
         12 891    ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns) 1o19jsksuisolslskosBN
         205 31    lksiosslsoujhs%ysus(0.33mismsn)  jsks,msnu
         22 14 67181  0204 - 2762.0uhsyuskis gnshj,msuis,lsolso2.2n iuwuwiwi

So till now I have tried the following based on earlier posts I saw:

      df['C'] =    df["A"].str.replace(r'\w \s*-\s*', '', regex=True)

      df['A'] = df['A'].str.replace(' ', '')
      df['A'] = df['A'].str.replace('-', ' - ')

      df["B"] = df["A"].str.replace(r'\s*;\s*\S ; \S ; \s*', ' ', regex=True).str.strip()

So I am getting partially correct output for Col C (where say 0204 in third row is not picked up) , but for Col B I am just getting the same Col A as follows:

       **mydfWrong**

                                B     
       12 - ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns); 891- 1o19jsksuisolslskosBN
       205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu
       22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi

                               C

            ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns) 1o19jsksuisolslskosBN
            lksiosslsoujhs%ysus(0.33mismsn)  jsks,msnu
            2762.0uhsyuskis gnshj,msuis,lsolso2.2n iuwuwiwi
          

Not sure how to use the regex properly in this case. So any help will be immensely appreciated. Thanks in advance.

CodePudding user response:

You can do that with something like the following.

The get_nums function splits on the ;, then removes the integers based on the specified conditions

The get_the_rest function does the opposite, and substitutes those values with empty spaces

def get_nums(x):
    return " ".join([re.search(r'^[0-9] ', y.strip()).group() for y in x.split(";")])

def get_the_rest(x):
    return "".join([re.sub(r'[0-9] \s?\-', '', y.strip()) for y in x.split(";")])

df["B"] = df["A"].apply(lambda x: get_nums(x))
df["C"] = df["A"].apply(lambda x: get_the_rest(x))

CodePudding user response:

You can use a single regex call to findall and then grab only the necessary data:

import pandas as pd
df = pd.DataFrame({'A': ['205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu', '22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi']})

rx = r'(\d )\s*-\s*(.*?)(?=;\s*\d|$)' # Define the regex
m = df['A'].str.findall(rx)         # Get all matches and captures
df["B"] = m.apply(lambda x: " ".join(c[0] for c in x)) # Join Group 1 values into Column B
df["C"] = m.apply(lambda x: " ".join(c[1] for c in x)) # Join Group 1 values into Column C

Output:

>>> print(df.to_string())
                                                                            A            B                                                       C
0                       205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu       205 31               lksiosslsoujhs%ysus(0.33mismsn) jsks,msnu
1  22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi  22 14 67181  0204 - 2762.0uhsyuskis gnshj,msuis,lsolso2.2n iuwuwiwi

See the regex demo that matches

  • (\d ) - Group 1: one or more digits
  • \s*-\s* - a - enclosed with zero or more whitespaces
  • (.*?) - Group 2: any zero or more chars other than line break chars as few as possible
  • (?=;\s*\d|$) - positive lookahead that matches a location that is immediately followed with ; zero or more whitespaces and then a digit, or end of string.
  • Related