I am reading in a CSV file to calculate some stats through Python.
I know that I can use the converters at the start of the program to adjust for some of the potential data issues, but for some reason when I try to do that, it errors with inflated results.
It's a 20-column CSV with over 1000 rows of data. Public domain datalink is here:
Obviously, this is no good either. How can I set either the header to bypass the word 'Episodes'
and do the calculations, or how do I rewrite the df = pd.read_csv (r'dataanime.csv', encoding='utf-8', header=None, skiprows=1, converters = {2 : lambda s: float(s.replace('Episodes','').join(s.replace('-','0')))})
to correct for this?
CodePudding user response:
It's easier to read CSV files by letting pandas figure out how to handle the headers. By not passing anything into header
and skiprows
, Pandas will infer that the first line in the CSV is the header line and name your columns appropriately. To deal with the "-"
Episode values, you can set na_values
to indicate that "-"
in that column is a NaN value, and use dropna()
to remove those rows when calculating statistics.
df = pd.read_csv("dataanime.csv", encoding="utf-8", na_values={"Episodes": "-"})
# calculate stats on the Episodes columns
episode_values = df["Episodes"].dropna()
mean1 = episode_values.mean()
sum1 = episode_values.sum()
...