I am writing a code that goes through many csv files in a folder(using a for loop), removes bad data from each csv file(where row values are more than number of columns or sometimes lesser than number of columns). After removing, I rearrange the columns and then I write the useful data into a new csv file.
Here in the code below the for loop is for cycling between different files present in a folder. You can assume the df=pd.read_csv
line as the beginning and assume correct indentation.
import pandas as pd
import os
for filename in os.listdir("csv files copy"):
filenames=os.path.join("csv files copy",filename)
print(filename)
df=pd.read_csv(filenames, error_bad_lines=False)
for row in df:
df.columns=["id","FirstName","LastName","UserName","Phone","IsContact","RestrictionReason","Status","IsScam","Date"]
df = df.drop(labels="Status", axis=1)
df = df.reindex(columns=['id', 'Phone', 'FirstName', 'LastName', 'UserName',"IsContact","IsScam","Date","RestrictionReason"])
df.to_csv(filenames,index=False)
While doing so this is the error I recieve.
ValueError: Length mismatch: Expected axis has 9 elements, new values have 10 elements
This is the first 4 values and the header of the dataframe that I am using:
id Phone FirstName LastName UserName IsContact IsScam Date RestrictionReason Status
Forex Pips Fire Free NaN Goldenboy NaN Goldenboyys False False 5/7/2022 8:34:07 AM NaN NaN
Forex Pips Fire Free NaN Abu 3odeh NaN oudah12 False False 5/7/2022 8:38:03 AM NaN NaN
Forex Pips Fire Free NaN Rahman Azar Rahman_Azar False False 5/7/2022 8:41:22 AM NaN NaN
Forex Pips Fire Free NaN HUDLE NaN Hudle1051 False False 5/7/2022 8:41:11 AM NaN NaN
And given below is the header of the destination csv file that the above data needs to be entered into
id Phone FirstName LastName UserName IsContact IsScam Date RestrictionReason
CodePudding user response:
Because you give only 9 columns in this line you missed the 'Status'
column
df = df.reindex(columns=['id', 'Phone', 'FirstName', 'LastName', 'UserName', 'IsContact', 'IsScam', 'Date', 'RestrictionReason'])
df.to_csv(filenames, index=False)
CodePudding user response:
You need to remove the for loop as follows:
import pandas as pd
import os
for filename in os.listdir("csv files copy"):
filenames = os.path.join("csv files copy", filename)
print(filename)
df = pd.read_csv(filenames, error_bad_lines=False)
df.columns = ["id", "FirstName", "LastName", "UserName", "Phone", "IsContact", "RestrictionReason", "Status", "IsScam", "Date"]
df = df.drop(labels="Status", axis=1)
df = df.reindex(columns=["id", "Phone", "FirstName", "LastName", "UserName","IsContact","IsScam","Date","RestrictionReason"])
df.to_csv(filenames, index=False)
This was causing the error and is not needed. The first time through the loop it correctly removes Status
column and saves the CSV file. The second time through the loop (on the same dataframe) it attempts to do df.columns
again but now there is no Status
column, so an incorrect number of columns are given.
The code for row in df:
would actually iterate over the column names in the dataframe,
e.g. id
then FirstName
etc.