Pandas .Split If Else-CodePudding

I've created a program so that I can split the dataset's column into 4 columns but some of my datasets only have 2 columns so when that section is reached, an error is thrown. I believe an if else statement can help with this.

Here is the code for my program:

import pandas as pd
import os
   
# reading csv file from url
filepath = "C:/Users/username/folder1/folder2/folder3/b 2 col.csv"
file_encoding = 'cp1252' 
data = pd.read_csv(filepath , header=None, names = list(range(0,4)) , encoding=file_encoding)

data.columns =['ID', 'Name',  'S ID', 'SName']

# new data frame with split value columns
new = data["Name"].str.split(",", n = 1, expand = True)

# making separate first name column from new data frame
data["Last Name"]= new[0]

# making separate last name column from new data frame
data["First Name"]= new[1]

# new data frame with split value columns (2)
new = data["SName"].str.split(",", n = 1, expand = True)

# making separate first name column from new data frame
data["S Last Name"]= new[0]

# making separate last name column from new data frame
data["S First Name"]= new[1]

# Saving File name as its path
filename = os.path.basename(filepath)   ".xlsx"
data.to_excel(filename, index=False)

data

This section onwards is responsible for the splitting of the second set of data

# new data frame with split value columns (2)
new = data["SName"].str.split(",", n = 1, expand = True)

Problem is not all of my CSV have four columns, so if I can implement an if else here to check if data is present then proceed else skip and move to the next section:

    # Saving File name as its path
filename = os.path.basename(filepath)   ".xlsx"
data.to_excel(filename, index=False)

data

I believe the program would work with my datasets

Link to an example of my datasets: https://drive.google.com/drive/folders/1nkLgo5tSFsxOTCa5EMWZlezDFi8AyaDq?usp=sharing

Thanks for helping

CodePudding user response：

You can split your 2 column to 4 column like that.


#if there are some missing columns
data['First Name'] = np.nan
data['Last Name'] = np.nan
data['S Last Name'] = np.nan
data['S Last Name'] = np.nan
#if there is not missing values remove above

data[['First_Name', 'Last Name']] = data.SName.str.split(",", expand=True)

data[['S First_Name', 'S Last Name']] = data.SName.str.split(",", expand=True)

CodePudding user response：

IIUC, assuming the (.csv) files are in the same folder, here is a proposition with pandas.concat :

import os
import pandas as pd
    
filepath = "C:/Users/username/folder1/folder2"
file_encoding = "cp1252"
    
list_df = []

for filename in os.listdir(filepath):
    if filename.endswith(".csv"):
        df = pd.read_csv(os.path.join(filepath, filename), header=None,
                        skiprows=1, encoding=file_encoding)
        df = pd.concat([df.iloc[:, i:i 2].pipe(lambda df_: df_.rename(columns={col:i for i, col in enumerate(df_.columns)}))
                        for i in range(0, df.shape[1], 2)], axis=0).set_axis(["ID", "FullName"], axis=1)
        df.insert(0, "filename", filename) #comment this line if you don't want to show the filename as a column
        
        list_df.append(df)

out = (pd.concat(list_df, ignore_index=True)
           .pipe(lambda df_: df_.join(df_["FullName"]
                                        .str.split(", ", expand=True)
                                        .rename(columns={0: "FirstName", 1: "LastName"}))))

Output :

print(out)

       filename  ID            FullName FirstName     LastName
0   a 4 col.csv   1        Bruce, Wayne     Bruce        Wayne
1   a 4 col.csv   2       James, Gordon     James       Gordon
2   a 4 col.csv   3        Fish, Mooney      Fish       Mooney
3   a 4 col.csv   4        Selina, Kyle    Selina         Kyle
4   a 4 col.csv   5     Harvey, Bullock    Harvey      Bullock
5   a 4 col.csv   6  Alfred, Pennyworth    Alfred   Pennyworth
6   a 4 col.csv   7   Oswald, Cobblepot    Oswald    Cobblepot
7   a 4 col.csv   8       Barbara, Kean   Barbara         Kean
8   a 4 col.csv   9       Edward, Nygma    Edward        Nygma
9   a 4 col.csv  10   Leslie, Thompkins    Leslie    Thompkins
10  b 2 col.csv   1        Bruce, Wayne     Bruce        Wayne
11  b 2 col.csv   2       James, Gordon     James       Gordon
12  b 2 col.csv   3        Fish, Mooney      Fish       Mooney
13  b 2 col.csv   4        Selina, Kyle    Selina         Kyle
14  b 2 col.csv   5     Harvey, Bullock    Harvey      Bullock