Python: count headers in a csv file-CodePudding

I want to now the numbers of headers my csv file contains (between 0 and ~50). The file itself is huge (so not reading the complete file for this is mandatory) and contains numerical data. I know that csv.Sniffer has a has_header() function, but that can only detect 1 header. One idea I had is to recursivly call the has_header funcion (supposing it detects the first header) and then counting the recursions. I am sure though, there is a much smarter way.

Googling was kind of a pain, since no matter what you search, if it includes "count" and "csv" at some point, you get all the "count rows in csv" results :D

Thanks in advance!

CodePudding user response：

Use re.search to search for lines that have 2 or more letters in a row. Two is used instead of one, to not count as header scientific notation (e.g., 1.0e5).

# In the shell, create a test file:
# echo "foo,bar\nbaz,bletch\n1e4,2.0\n2E5,2" > in_file.csv

import re
num_header_lines = 0
for line in open('in_file.csv'):
    if re.search('[A-Za-z]{2,}', line):
        # count the header here
        num_header_lines  = 1
    else:
        break
print(num_header_lines)
# 2

CodePudding user response：

Try this:

import pandas as pd

df = pd.read_csv('your_file.csv', index_col=0)

num_rows, num_cols = df.shape

CodePudding user response：

Here's a sketch for finding the first line which matches a particular criterion. For demo purposes, I use the criterion "there are empty fields":

import csv

with open(filename, "r", encoding="utf-8") as handle:
    for lineno, fields in enumerate(csv.reader(handle), 1):
        if "" in fields:
             print(lineno-1)
             break

You'd update it to look for something which makes sense for your data, like perhaps "third and eight fields contain numbers":

        try:
            float(fields[2])
            float(fields[7])
            print(lineno-1)
            break
        except ValueError:
            continue

(notice how the list fields is indexed starting at zero, so the first field is fields[0] and the third is fields[2]), or perhaps a more sophisticated model where the first line contains no empty fields, successive lines contain more and more empty fields, and then the first data line contains fewer empty fields:

    maxempty = 0
    for lineno, fields in numerate(csv.reader(handle), 1):
        empty = fields.count("")
        if empty > maxempty:
            maxempty = empty
        elif empty < maxempty:
            print(lineno-1)
            break

We simply print the line number of the last header line, since your question asks how many there are. Perhaps printing or returning the number of the first data line would make more sense in some scenarios.

This code doesn't use Pandas at all, just the regular csv module from the Python standard library. It stops reading when you hit break so it doesn't matter for performance how many lines there are after that (though if you need to experiment or debug, maybe create a smaller file with only, say, the first 200 lines of your real file).

CodePudding user response：

This only reads first line of csv

import csv

with open('ornek.csv', newline='') as f:
  reader = csv.reader(f)
  row1 = next(reader)
sizeOfHeader = len(row1)

CodePudding user response：

Well, I think that you could get the first line of the csv file and then split it by a ",". That will return an array with all the headers in it. Now you can just count them with len.