Home > database >  how to delete unwanted data from a csv file in python
how to delete unwanted data from a csv file in python

Time:10-17

import pandas as pd
sea_level_df = pd.read_csv(r"C:\Users\slaye\OneDrive\Desktop\SeaLevel.csv")
display(sea_level_df)

I'm trying to delete the first 3 rows of this file without literally highlighting the unwanted text in the actual file and pressing backspace. Is there a way I can do this in python?

this is the top of the csv file:

#title = mean sea level anomaly global ocean (66S to 66N) (Annual signals retained) 
#institution = NOAA/Laboratory for Satellite Altimetry 
#references = NOAA Sea Level Rise 
year,TOPEX/Poseidon,Jason-1,Jason-2,Jason-3
1992.9614,-16.27000,
1992.9865,-17.97000,
1993.0123,-14.87000,
1993.0407,-19.87000,
1993.0660,-25.27000,
1993.0974,-29.37000,

I want to delete the first 3 hashed rows of text so I can parse this into a table in pandas. I'm getting the following error:

ParserError                               Traceback (most recent call last)
Input In [14], in <cell line: 2>()
      1 import pandas as pd
----> 2 sea_level_df = pd.read_csv(r"C:\Users\slaye\OneDrive\Desktop\SeaLevel.csv")
      3 display(sea_level_df)

File ~\anaconda3\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File ~\anaconda3\lib\site-packages\pandas\io\parsers\readers.py:680, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    665 kwds_defaults = _refine_defaults_read(
    666     dialect,
    667     delimiter,
   (...)
    676     defaults={"delimiter": ","},
    677 )
    678 kwds.update(kwds_defaults)
--> 680 return _read(filepath_or_buffer, kwds)

File ~\anaconda3\lib\site-packages\pandas\io\parsers\readers.py:581, in _read(filepath_or_buffer, kwds)
    578     return parser
    580 with parser:
--> 581     return parser.read(nrows)

File ~\anaconda3\lib\site-packages\pandas\io\parsers\readers.py:1254, in TextFileReader.read(self, nrows)
   1252 nrows = validate_integer("nrows", nrows)
   1253 try:
-> 1254     index, columns, col_dict = self._engine.read(nrows)
   1255 except Exception:
   1256     self.close()

File ~\anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:225, in CParserWrapper.read(self, nrows)
    223 try:
    224     if self.low_memory:
--> 225         chunks = self._reader.read_low_memory(nrows)
    226         # destructive to chunks
    227         data = _concatenate_chunks(chunks)

File ~\anaconda3\lib\site-packages\pandas\_libs\parsers.pyx:805, in pandas._libs.parsers.TextReader.read_low_memory()

File ~\anaconda3\lib\site-packages\pandas\_libs\parsers.pyx:861, in pandas._libs.parsers.TextReader._read_rows()

File ~\anaconda3\lib\site-packages\pandas\_libs\parsers.pyx:847, in pandas._libs.parsers.TextReader._tokenize_rows()

File ~\anaconda3\lib\site-packages\pandas\_libs\parsers.pyx:1960, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 5

CodePudding user response:

From the read_csv documentation, you can use skiprows = 3 to ignore the first 3 rows of the file.

Otherwise, pandas just reads your csv from the top down and assumes that all rows will follow the pattern of the first row. It doesn't see any delimiters (comma, tab, etc.) in the first row, so it assumes your data only has one column. The next few rows follow this same pattern (no delimiters = 1 column), then all of a sudden, there's a comma in the 4th row! Pandas sees this as a delimiter (which would indicate more than one column), but since there weren't any in the first rows, it thinks there should only be one column for the whole csv, so it throws the error.

  • Related