Home > front end >  Splitting a column in pandas dataframe using regular expression
Splitting a column in pandas dataframe using regular expression

Time:03-02

I have a column in my pandas dataframe that needs to be split using multiple delimiters into multiple columns without the loss of delimiters. The pandas dataframe is as follows:

      Unnamed: 0  Unnamed: 0.1  Sr. No.    Project Code  ... Total Amount Sanctioned (In Lakhs)    Supported By Current Status of Project                                  Component Details
0              0             0        1  28AP2000000002  ...                              50.00  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE09350.00MT
1              1             1        2  28AP2000000001  ...                              31.93  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE04800.00MT
2              2             2        3  28AP2000000004  ...                              50.00  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE08480.00MT
3              3             3        4  28AP2000000003  ...                              50.00  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE08650.00MT
4              4             4        5  28AP2000000005  ...                              34.20  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE03420.00MT
...          ...           ...      ...             ...  ...                                ...             ...                       ...                                                ...
2693          47            47       48    19AWB0004861  ...                               7.29             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694          48            48       49    19AWB0004860  ...                               6.59             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695          49            49       50    19AWB0004864  ...                              12.86             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696          50            50       51    19AWB0004865  ...                               2.39             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697          51            51       52    19AWB0004804  ...                               4.34             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019...

The column to be split is the last column 'Component Details'. The original example of how to split the column is as follows: enter image description here

enter image description here

As you can see, I have to split the last column into 5 different columns. My approach was to use regular expression to separate one column after another. Like for example, the first column could be separated with the rest by using fin = re.split(r"(RE|GE)", str, maxsplit=1) but I am not able to understand how to apply this for the entire pandas column. Can anybody help?

The column in the current form is:

0                 COLD / CA STORAGECOLD STORAGE09350.00MT
1                 COLD / CA STORAGECOLD STORAGE04800.00MT
2                 COLD / CA STORAGECOLD STORAGE08480.00MT
3                 COLD / CA STORAGECOLD STORAGE08650.00MT
4                 COLD / CA STORAGECOLD STORAGE03420.00MT
                              ...
2693    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019..

The output that I'm looking for looks like:

COLD / CA STORAGE           COLD STORAGE    0   200     MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   75      MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   75      MT
COLD / CA STORAGE           COLD STORAGE    1   75      MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  17700   TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  17400   TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  21000   TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  3       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  3       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  3       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  5       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  5       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  10199   TONNES

CodePudding user response:

In order to mimic reading your data file I do

import re
import pandas as pd
import io

text = """Component Details
              COLD / CA STORAGECOLD STORAGE09350.00MT
              COLD / CA STORAGECOLD STORAGE04800.00MT
              COLD / CA STORAGECOLD STORAGE08480.00MT
              COLD / CA STORAGECOLD STORAGE08650.00MT
              COLD / CA STORAGECOLD STORAGE03420.00MT
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019TONES
"""

Looking at the structure, my guess is that the part with category is determined by a fixed width column, therefore in order to read the data you need to use a fixed field reader:

df = pd.read_fwf(io.StringIO(text), widths=[31, 100], skiprows=1, header=None)
df.columns = ["Category", "Component"]

This yields a dataframe where I have already obtained the component indicated by the first 31 characters. The data frame now looks like:

                      Category                 Component
0            COLD / CA STORAGE    COLD STORAGE09350.00MT
1            COLD / CA STORAGE    COLD STORAGE04800.00MT
2            COLD / CA STORAGE    COLD STORAGE08480.00MT
3            COLD / CA STORAGE    COLD STORAGE08650.00MT
4            COLD / CA STORAGE    COLD STORAGE03420.00MT
5  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA3.00NOS
6  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA3.00NOS
7  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA5.00NOS
8  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA5.00NOS
9  POST HARVEST INFRASTRUCTURE  CONVEYOR BELTNA1019TONES

In order to split on a regular expression you need to have pandas version 1.4 and higher. Moreover, in your case you also want to keep the matched pattern. There is not a direct way to get this, therefore I wrote a small function to retrieve a list of matched patterns:

def get_match_list(regex, dataserie):
    matches = list()
    for line in dataserie:
        if match := re.search(regex, line):
            matches.append(match.group(0))
        else:
            matches.append(None)
    return matches

This function returns a list of matches of regex in a data series.

Start with retrieving the number of components based on a 2 digit number of Na:

df["number"] = get_match_list("NA|\d{2}", df["Component"])

Now you can split your component column (which throws away the matched pattenr):

details = df["Component"].str.split("NA|\d{2}", expand=True, regex=True, n=1)
details.columns = ["Component", "Weight"]

Use the function again to get the units and remove them from the weight column:

details["unit"] = get_match_list("[A-Z] ", details["Weight"])
details["Weight"] = details["Weight"].str.replace("[A-Z]", "", regex=True)

Finally, put everything back together:

total = pd.concat([df[["Category", "number"]], details], axis=1)
total = total[["Category", "Component", "number", "Weight", "unit"]]

The result looks like:

                     Category      Component number  Weight   unit
0            COLD / CA STORAGE   COLD STORAGE     09  350.00     MT
1            COLD / CA STORAGE   COLD STORAGE     04  800.00     MT
2            COLD / CA STORAGE   COLD STORAGE     08  480.00     MT
3            COLD / CA STORAGE   COLD STORAGE     08  650.00     MT
4            COLD / CA STORAGE   COLD STORAGE     03  420.00     MT
5  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    3.00    NOS
6  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    3.00    NOS
7  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    5.00    NOS
8  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    5.00    NOS
9  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    1019  TONES
  • Related