I have a column in my pandas dataframe that needs to be split using multiple delimiters into multiple columns without the loss of delimiters. The pandas dataframe is as follows:
Unnamed: 0 Unnamed: 0.1 Sr. No. Project Code ... Total Amount Sanctioned (In Lakhs) Supported By Current Status of Project Component Details
0 0 0 1 28AP2000000002 ... 50.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE09350.00MT
1 1 1 2 28AP2000000001 ... 31.93 Through NABARD Completed COLD / CA STORAGECOLD STORAGE04800.00MT
2 2 2 3 28AP2000000004 ... 50.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE08480.00MT
3 3 3 4 28AP2000000003 ... 50.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE08650.00MT
4 4 4 5 28AP2000000005 ... 34.20 Through NABARD Completed COLD / CA STORAGECOLD STORAGE03420.00MT
... ... ... ... ... ... ... ... ... ...
2693 47 47 48 19AWB0004861 ... 7.29 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694 48 48 49 19AWB0004860 ... 6.59 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695 49 49 50 19AWB0004864 ... 12.86 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696 50 50 51 19AWB0004865 ... 2.39 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697 51 51 52 19AWB0004804 ... 4.34 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019...
The column to be split is the last column 'Component Details'. The original example of how to split the column is as follows:
As you can see, I have to split the last column into 5 different columns. My approach was to use regular expression to separate one column after another. Like for example, the first column could be separated with the rest by using fin = re.split(r"(RE|GE)", str, maxsplit=1)
but I am not able to understand how to apply this for the entire pandas column. Can anybody help?
The column in the current form is:
0 COLD / CA STORAGECOLD STORAGE09350.00MT
1 COLD / CA STORAGECOLD STORAGE04800.00MT
2 COLD / CA STORAGECOLD STORAGE08480.00MT
3 COLD / CA STORAGECOLD STORAGE08650.00MT
4 COLD / CA STORAGECOLD STORAGE03420.00MT
...
2693 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019..
The output that I'm looking for looks like:
COLD / CA STORAGE COLD STORAGE 0 200 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 75 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 75 MT
COLD / CA STORAGE COLD STORAGE 1 75 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 17700 TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 17400 TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 21000 TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 5 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 5 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 10199 TONNES
CodePudding user response:
In order to mimic reading your data file I do
import re
import pandas as pd
import io
text = """Component Details
COLD / CA STORAGECOLD STORAGE09350.00MT
COLD / CA STORAGECOLD STORAGE04800.00MT
COLD / CA STORAGECOLD STORAGE08480.00MT
COLD / CA STORAGECOLD STORAGE08650.00MT
COLD / CA STORAGECOLD STORAGE03420.00MT
POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019TONES
"""
Looking at the structure, my guess is that the part with category is determined by a fixed width column, therefore in order to read the data you need to use a fixed field reader:
df = pd.read_fwf(io.StringIO(text), widths=[31, 100], skiprows=1, header=None)
df.columns = ["Category", "Component"]
This yields a dataframe where I have already obtained the component indicated by the first 31 characters. The data frame now looks like:
Category Component
0 COLD / CA STORAGE COLD STORAGE09350.00MT
1 COLD / CA STORAGE COLD STORAGE04800.00MT
2 COLD / CA STORAGE COLD STORAGE08480.00MT
3 COLD / CA STORAGE COLD STORAGE08650.00MT
4 COLD / CA STORAGE COLD STORAGE03420.00MT
5 POST HARVEST INFRASTRUCTURE CONVEYOR BELTNA3.00NOS
6 POST HARVEST INFRASTRUCTURE CONVEYOR BELTNA3.00NOS
7 POST HARVEST INFRASTRUCTURE CONVEYOR BELTNA5.00NOS
8 POST HARVEST INFRASTRUCTURE CONVEYOR BELTNA5.00NOS
9 POST HARVEST INFRASTRUCTURE CONVEYOR BELTNA1019TONES
In order to split on a regular expression you need to have pandas version 1.4 and higher. Moreover, in your case you also want to keep the matched pattern. There is not a direct way to get this, therefore I wrote a small function to retrieve a list of matched patterns:
def get_match_list(regex, dataserie):
matches = list()
for line in dataserie:
if match := re.search(regex, line):
matches.append(match.group(0))
else:
matches.append(None)
return matches
This function returns a list of matches of regex in a data series.
Start with retrieving the number of components based on a 2 digit number of Na:
df["number"] = get_match_list("NA|\d{2}", df["Component"])
Now you can split your component column (which throws away the matched pattenr):
details = df["Component"].str.split("NA|\d{2}", expand=True, regex=True, n=1)
details.columns = ["Component", "Weight"]
Use the function again to get the units and remove them from the weight column:
details["unit"] = get_match_list("[A-Z] ", details["Weight"])
details["Weight"] = details["Weight"].str.replace("[A-Z]", "", regex=True)
Finally, put everything back together:
total = pd.concat([df[["Category", "number"]], details], axis=1)
total = total[["Category", "Component", "number", "Weight", "unit"]]
The result looks like:
Category Component number Weight unit
0 COLD / CA STORAGE COLD STORAGE 09 350.00 MT
1 COLD / CA STORAGE COLD STORAGE 04 800.00 MT
2 COLD / CA STORAGE COLD STORAGE 08 480.00 MT
3 COLD / CA STORAGE COLD STORAGE 08 650.00 MT
4 COLD / CA STORAGE COLD STORAGE 03 420.00 MT
5 POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3.00 NOS
6 POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3.00 NOS
7 POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 5.00 NOS
8 POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 5.00 NOS
9 POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 1019 TONES