How to read csv files with different amounts of columns using Python-CodePudding

import glob
files = glob.glob("Data/*.csv")
df = pd.concat((pd.read_csv(f) for f in files))
print(df)

I get an error that says: "ParserError: Error tokenizing data. C error: Expected 39 fields in line 273, saw 40". Then as per this question: import csv with different number of columns per row using Pandas , I tried passing in the names of the columns, using StringIO and BytesIO, then I got errors like: "TypeError: initial_value must be str or None, not list" or "TypeError: a bytes-like object is required, not 'list'"

CodePudding user response：

it looks like you have not tried all solutions as you actually had an answer in the link you shared: https://stackoverflow.com/a/57824142/8805842 if you inspect the last row/last column cell in your .csv file you will see why you get error.

Solution (simple copy/paste from your question link) with 2 more rows to remove unwanted/empty columns

    ### Loop the data lines
    with open("storm_data_search_results.csv", 'r') as temp_f:
        # get No of columns in each line
        col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
    
    ### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
    column_names = [i for i in range(0, max(col_count))]
    
    ### Read csv
    df = pd.read_csv("storm_data_search_results.csv", header=None, delimiter=",", names=column_names)
    
    # my addition
    df.columns = df.iloc[0] # create headers from the first row
    df = df.iloc[:, 0:39] # keeping data frame with named headers only

Update OMG, be careful... the data they give in .csv actually is not structured properly.... just scroll all it down... if you can use any other source, - use it, Unless you do not need "comments" and you can drop them.

CodePudding user response：

Assuming that the problem comes from the text fields that are multiline and can easily get messed up... ...you can remove them using RegEx: re.subn(r'(".*?")',"_______________",xx,xx.count('"'), re.DOTALL)

Also, assuming constant headers in all files, you can process all in text and then parse once.


# Read headers
headers = open(files[0]).read().split('\n',1)[0].split(',')

# Read all files and remove headers
xx = [open(ff).read().split('\n',1)[1] for ff in files]

# Remove the comments fields
dd = [re.sub(r'(".*?")',"__",x,x.count('"'), re.DOTALL) for x in xx]

# Load as CSV
df = pd.read_csv(StringIO(''.join(dd), names = headers)