This is my first time asking a question in this forum, hopefully i won't make a fool of myself. I am a student in an IT education and i was briefly introduced to the CSV and Matplotlib libraries today. An assignment was to make a graph/diagram of the maximum and minimum temperatures and the corresponding dates in this CSV file. I need the row numbers and i need the program to understand the right format/syntax of the cells, but i am really not sure how to.
Example of CSV file here: "STATION","NAME","DATE","PRCP","TMAX","TMIN","TOBS" "USC00042319","DEATH VALLEY, CA US","2018-01-01","0.00","65","34","42" "USC00042319","DEATH VALLEY, CA US","2018-01-02","0.00","61","38","46" "USC00042319","DEATH VALLEY, CA US","2018-01-03","0.00","69","34","54" "USC00042319","DEATH VALLEY, CA US","2018-01-04","0.00","69","39","48" "USC00042319","DEATH VALLEY, CA US","2018-01-05","0.00","74","40","57" "USC00042319","DEATH VALLEY, CA US","2018-01-06","0.00","74","47","65" "USC00042319","DEATH VALLEY, CA US","2018-01-07","0.00","77","54","60" "USC00042319","DEATH VALLEY, CA US","2018-01-08","0.07","62","52","52" "USC00042319","DEATH VALLEY, CA US","2018-01-09","0.40","60","51","51" "USC00042319","DEATH VALLEY, CA US","2018-01-10","0.00","64","49","50"
This is what i got:
import csv
import matplotlib.pyplot as plt
filename = 'death_valley_2018_simple.csv'
with open(filename) as f:
csv_reader = csv.reader(f, delimiter=',')
line_count = 0
for row in f:
x=(row[4], row[5])
y=(row[2])
print(row[2])
print(row[4])
print(row[5])
plt.bar(x,y)
plt.xticks(y)
plt.ylabel('Dates')
plt.title('Plot')
plt.show()
the result is this "bar graph" I read other forum posts from here, asked around on Discord and read the documentation for CSV. Maybe the answer is there, but i don't understand it then. I hope someone will explain this to me like im 5 years old.
CodePudding user response:
Personal Advice
Don't worry; I got you. But first some advice. I remember when I posted my first question on this forum, I didn't know the proper way to ask a question (and my English wasn't that good at that time). The key to asking a perfect question is to search first (which you did), and then if you didn't find an answer, you should ask your question as clear as possible and as short as possible. I'm not saying don't give enough information, but if you can ask your question in fewer words and your question is still as clear as possible, you should do it. Why? Because the truth is so many people will skip the question if it is long. Just now, when I opened your question and saw the lines, I was a little intimidated and wanted to skip it :D, but I solved it in a few minutes, and it wasn't scary at all. I am less concerned about writing long answers because those with a problem will read your answer if they have to. Please note that all of this was just my personal experience. You should also look for better beginner guides to ask questions on this forum and similar platforms. My suggestion: http://www.catb.org/~esr/faqs/smart-questions.html
Now the Answer
Instead of the csv
library, which is a Python standard library (means it's part of the programming language when you install it and doesn't need to be installed separated), I prefer using pandas
. pandas
will make your life much more easier. But you have to install it first:
pip install pandas
Now it's quite simple, let's import everything and load the csv
file.
import pandas as pd
import matplotlib.pyplot as plt
filename = 'death_valley_2018_simple.csv'
dataframe = pd.read_csv(filename)
dataframe
contains your csv
file's rows and columns. We need to convert DATE
column from str
to datetime
.
dataframe["DATE"] = pd.to_datetime(dataframe['DATE'], format="%Y-%m-%d")
So we are just telling pandas to change the DATE
column to datetime
, and we are telling where is the number for year and month and day is by specifying the format field. %Y represents the year, then there is a dash, %m represents the month, and ..., we are using capital Y because %y represents the year when we only have the two digits on the right. In this case, since it is pretty straightforward, pandas
will understand how to convert this column to datetime
even if we didn't specify the format.
Now we just have to plot our diagram/graph:
fig, ax = plt.subplots()
ax.plot(dataframe["DATE"], dataframe["TMAX"])
ax.plot(dataframe["DATE"], dataframe["TMIN"])
fig.autofmt_xdate()
fig.show()
So after doing everything, your code should look like this:
import pandas as pd
import matplotlib.pyplot as plt
filename = 'death_valley_2018_simple.csv'
dataframe = pd.read_csv(filename)
dataframe["DATE"] = pd.to_datetime(dataframe['DATE'], format="%Y-%m-%d")
fig, ax = plt.subplots()
ax.plot(dataframe["DATE"], dataframe["TMAX"])
ax.plot(dataframe["DATE"], dataframe["TMIN"])
fig.autofmt_xdate()
fig.show()
Without pandas
You can do the exact same thing without the pandas
library; you just have to do some things manually.
Importing the libraries (no pandas
this time):
import csv
import datetime
import matplotlib.pyplot as plt
This will create a python dictionary similar to a pandas
data frame:
filename = "death_valley_2018_simple.csv"
with open(filename, "r") as file:
csv_reader = csv.reader(file)
headers = next(csv_reader)
data = {}
for title in headers:
data[title] = []
for row in csv_reader:
for i, title in enumerate(headers):
data[title].append(row[i])
Same as before, we should convert the DATE
column from str
to datetime
. We also have to convert the TMAX
and TMIN
column to int
; pandas
did this automatically for us. The first loop takes care of the DATE
column, and the second and third one is for the TMAX
and TMIN
columns.
for i in range(len(data["DATE"])):
data["DATE"][i] = datetime.datetime.strptime(data["DATE"][i], "%Y-%m-%d")
for i in range(len(data["TMAX"])):
data["TMAX"][i] = int(data["TMAX"][i])
for i in range(len(data["TMIN"])):
data["TMIN"][i] = int(data["TMIN"][i])
Now, we can plot our diagram/graph:
fig, ax = plt.subplots()
ax.plot(data["DATE"], data["TMAX"])
ax.plot(data["DATE"], data["TMIN"])
fig.autofmt_xdate()
fig.show()
So after doing everything, your code should look like this:
import csv
import datetime
import matplotlib.pyplot as plt
filename = "death_valley_2018_simple.csv"
with open(filename, "r") as file:
csv_reader = csv.reader(file)
headers = next(csv_reader)
data = {}
for title in headers:
data[title] = []
for row in csv_reader:
for i, title in enumerate(headers):
data[title].append(row[i])
for i in range(len(data["DATE"])):
data["DATE"][i] = datetime.datetime.strptime(data["DATE"][i], "%Y-%m-%d")
for i in range(len(data["TMAX"])):
data["TMAX"][i] = int(data["TMAX"][i])
for i in range(len(data["TMIN"])):
data["TMIN"][i] = int(data["TMIN"][i])
fig, ax = plt.subplots()
ax.plot(data["DATE"], data["TMAX"])
ax.plot(data["DATE"], data["TMIN"])
fig.autofmt_xdate()
fig.show()
Hard Coding, a Rookie Mistake
You said:
There is 365 lines in the file, so maybe it would be nice to limit the program to taking maybe the first 10 lines
Search hard coding and read about it. Hard coding is a rookie mistake in beginners, I've done it a thousand times but you have to be aware of it. We are not writing our code in a way that it matters if there are 10 rows in the csv
file or if there are 10,000 rows. Hard coding means that you are embedding some unnecessary data in your program and your program can work only in certain examples. You shouldn't write a program that only works if there are 10 rows or 100 rows, you should write your program so it would work without knowing the number of rows.