How to check if a file is a CSV, XML, or JSON without relying on the file extension?-CodePudding

I have a program that accepts a combination of multiple input files, and also can take directories to scan for those input files.

Normally, I only accept files with valid file extensions for CSV, XML, and JSON - but as I've learned over the years, you can't really trust the user to give the correct files.

For example, if a user has a file with an XML extension, but the actual contents are JSON, I would want to warn the user and treat the file as a JSON file. Another situation is the on Unix-based operating systems, file extensions aren't used for identifying file types like Windows does, which can open up more situations of incorrect file type compared to the file contents.

Right now, my program collects the list of files, filter them into only the ones with acceptable file extensions, and then I create a concurrent.futures.ThreadPoolExecutor to be able to check all the files simultaneously. The actual check looks like so:

with open(the_file, "r") as f:
  check = f.read(1)
  if check == "{" or check == "[":
    file_type = "json"
  elif check == "<":
    file_type = "xml"
  else:
    file_type = "csv"

There are two major issues with this:

If the file is supposed to be a CSV, and the first character for it is either [ or { or <, then the program will assign it the wrong file type.
Conversely, if the file is not any of the acceptable types, and it does not start with either [ or { or <, then the program will just assume that the file is supposed to be a CSV.

I have the following idea to remedy this:

import csv
import json
import defusedxml.ElementTree as xml

try:
  loaded = xml.parse(the_file)
  return "XML"
except xml.ParseError:
  del loaded
  try:
    loaded = json.load(the_file)
    return "JSON"
  except json.JSONDecodeError:
    del loaded
    try:
      with open(the_file, "r") as csv_file:
        loaded = csv.DictReader(csv_file)
      return "CSV"
    except csv.Error:
      print("The file is not in any acceptable format")

The problem with this is that since my program tries to run this check on multiple files simultaneously, the memory usage can get dangerously high with so many files opened at the same time. The other minor issue is the I/O cost of opening and closing so many files at once just to check for the file type.

IS there some more efficient alternative for what I'm trying to do?

CodePudding user response：

Here are just some corner case examples to show that you cannot both read only some characters and hope a robust classification:

Starts like a json but actually is a csv with semicolon delimiter:

[12];[13,14]
...

Starts like a xml but again is a csv

<foo>,<bar>,<fee>
1,2,3
...

Starts with spaces but is a valid xml (spaces are noted as _ here):

__<foo>text...</foo>

IMHO, the best you can do is to read the first characters of the file to decide what format to test first, but:

you will have to read each file at least once
you will have to read more than once the corner cases or incorrect files

CodePudding user response：

XML JSON and CSV are presented differently and can be differentiated by the first line.
JSON for example probably will start with { or [{.
XML for example should start with <?xml version="1.0" encoding="UTF-8"?>
CSV is a bit trickier because the first line can have much more variety of inputs, but depending on the expected data, you can custom other tests.

This method clearly does not validate the actual data but can help detect which format you are dealing with.