What is the best way to find the first newline in a file when the input file is sometimes a .csv and sometimes a .xls. The newline is guaranteed, but the newline is always at a random row when reading the file. The input file will have a certain amount of rows, always at the top. This data is variable by a line or two. So I will skip the first 4, 5, 6, because of this unpredictability. My goal here is to read the data beyond that point into a DataFrame, skipping those first rows. The line right after the first blank line is where I will start reading the data in to the df
. So something that just skips this variable amount of rows is what I am missing, I have a small function that identifies file type, if that code returns true the file is a xls file and if false the file is a CSV file. In my example file below the first blank row is at row 7.
1: CSV
This reads forever and I have to interrupt execution for the program to quit. A key point, when running f.readline() and looking at the output line by line I notice the file passes the blank line because it is not '\n'
as expected. Instead it's always something like ',,,,,,,,,,\n'
with no consistency across my many csv files. How can I write something to identify this as a blank line without always tweaking code to account for new amount of commas in the first blank row in the CSV file?
import pandas as pd
file = 'input_file.csv'
f = open(file)
while f.readline() not in ('\n'):
pass
final_df = pd.read_csv(f, header=None)
Example file
.
report | |||
---|---|---|---|
random info | |||
more info | |||
Project number | 111111 | ||
Order number | |||
Plates | Plate1 | Plate2 | Plate3 |
DNA \ Assay | id1 | id2 | id3 |
Name1 | C:C | G:G | T:C |
Name2 | C:C | G:G | C:C |
Name3 | C:C | G:G | T:C |
Current output for the readline function that is looking for the newline, at the newline:
',,,,,,,,,,\n'
final_df
expected output
DNA \ Assay | id1 | id2 | id3 |
---|---|---|---|
Name1 | C:C | G:G | T:C |
Name2 | C:C | G:G | C:C |
Name3 | C:C | G:G | T:C |
2: XLS
When the files are in the xls file format, they appear the exact same as my example file used above. The example file provides the data exactly as needed for this question, no changes needed.
My idea to read the files if they are input as a xls file is to
import pandas as pd
df = pd.read_excel(file)
f = tempfile.NamedTemporaryFile()
df.to_csv(f)
f.seek(0)
line = str(f.readline()).strip()
and the current output after a print(line)
returns
b',report,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46\n'
I'm not wanting to continue reading the file this way if there is another way to find the first blank line with pd.read_excel(line)
.
The expected output is the same as listed above in final_df
I would ideally use something like final_df = pd.read_csv(line)
to produce the final_df
, that does not work.
DNA \ Assay | id1 | id2 | id3 |
---|---|---|---|
Name1 | C:C | G:G | T:C |
Name2 | C:C | G:G | C:C |
Name3 | C:C | G:G | T:C |
CodePudding user response:
I would think easiest way to handle this, specially considering you might have csv or xls files is to read the data and clean it afterwards. Something like this might help and would work on both formats:
df = pd.read_excel(file)
new_line = min(df[df.iloc[:,0].isnull()].index)
df.columns = df.iloc[new_line 1]
df = df.iloc[new_line 2:, :]
Essentially you read the whole file, find the first empty line, and reconstruct the dataframe starting from the "new_line".