I've created a program so that I can split the dataset's column into 4 columns but some of my datasets only have 2 columns so when that section is reached, an error is thrown. I believe an if else statement can help with this.
Here is the code for my program:
import pandas as pd
import os
# reading csv file from url
filepath = "C:/Users/username/folder1/folder2/folder3/b 2 col.csv"
file_encoding = 'cp1252'
data = pd.read_csv(filepath , header=None, names = list(range(0,4)) , encoding=file_encoding)
data.columns =['ID', 'Name', 'S ID', 'SName']
# new data frame with split value columns
new = data["Name"].str.split(",", n = 1, expand = True)
# making separate first name column from new data frame
data["Last Name"]= new[0]
# making separate last name column from new data frame
data["First Name"]= new[1]
# new data frame with split value columns (2)
new = data["SName"].str.split(",", n = 1, expand = True)
# making separate first name column from new data frame
data["S Last Name"]= new[0]
# making separate last name column from new data frame
data["S First Name"]= new[1]
# Saving File name as its path
filename = os.path.basename(filepath) ".xlsx"
data.to_excel(filename, index=False)
data
This section onwards is responsible for the splitting of the second set of data
# new data frame with split value columns (2)
new = data["SName"].str.split(",", n = 1, expand = True)
Problem is not all of my CSV have four columns, so if I can implement an if else here to check if data is present then proceed else skip and move to the next section:
# Saving File name as its path
filename = os.path.basename(filepath) ".xlsx"
data.to_excel(filename, index=False)
data
I believe the program would work with my datasets
Link to an example of my datasets: https://drive.google.com/drive/folders/1nkLgo5tSFsxOTCa5EMWZlezDFi8AyaDq?usp=sharing
Thanks for helping
CodePudding user response:
You can split your 2 column to 4 column like that.
#if there are some missing columns
data['First Name'] = np.nan
data['Last Name'] = np.nan
data['S Last Name'] = np.nan
data['S Last Name'] = np.nan
#if there is not missing values remove above
data[['First_Name', 'Last Name']] = data.SName.str.split(",", expand=True)
data[['S First_Name', 'S Last Name']] = data.SName.str.split(",", expand=True)
CodePudding user response:
IIUC, assuming the (.csv
) files are in the same folder, here is a proposition with pandas.concat
:
import os
import pandas as pd
filepath = "C:/Users/username/folder1/folder2"
file_encoding = "cp1252"
list_df = []
for filename in os.listdir(filepath):
if filename.endswith(".csv"):
df = pd.read_csv(os.path.join(filepath, filename), header=None,
skiprows=1, encoding=file_encoding)
df = pd.concat([df.iloc[:, i:i 2].pipe(lambda df_: df_.rename(columns={col:i for i, col in enumerate(df_.columns)}))
for i in range(0, df.shape[1], 2)], axis=0).set_axis(["ID", "FullName"], axis=1)
df.insert(0, "filename", filename) #comment this line if you don't want to show the filename as a column
list_df.append(df)
out = (pd.concat(list_df, ignore_index=True)
.pipe(lambda df_: df_.join(df_["FullName"]
.str.split(", ", expand=True)
.rename(columns={0: "FirstName", 1: "LastName"}))))
Output :
print(out)
filename ID FullName FirstName LastName
0 a 4 col.csv 1 Bruce, Wayne Bruce Wayne
1 a 4 col.csv 2 James, Gordon James Gordon
2 a 4 col.csv 3 Fish, Mooney Fish Mooney
3 a 4 col.csv 4 Selina, Kyle Selina Kyle
4 a 4 col.csv 5 Harvey, Bullock Harvey Bullock
5 a 4 col.csv 6 Alfred, Pennyworth Alfred Pennyworth
6 a 4 col.csv 7 Oswald, Cobblepot Oswald Cobblepot
7 a 4 col.csv 8 Barbara, Kean Barbara Kean
8 a 4 col.csv 9 Edward, Nygma Edward Nygma
9 a 4 col.csv 10 Leslie, Thompkins Leslie Thompkins
10 b 2 col.csv 1 Bruce, Wayne Bruce Wayne
11 b 2 col.csv 2 James, Gordon James Gordon
12 b 2 col.csv 3 Fish, Mooney Fish Mooney
13 b 2 col.csv 4 Selina, Kyle Selina Kyle
14 b 2 col.csv 5 Harvey, Bullock Harvey Bullock