I have a csv file as following:
0 2 1 1 464 385 171 0:44:4
1 1 2 26 254 444 525 0:56:2
2 3 1 90 525 785 522 0:52:8
3 8 2 3 525 233 555 0:52:8
4 7 1 10 525 433 522 1:52:8
5 9 2 55 525 555 522 1:52:8
6 6 3 3 392 111 232 1:43:4
7 1 4 23 322 191 112 1:43:4
8 1 3 30 322 191 112 1:43:4
9 1 5 2 322 191 112 1:43:4
10 1 3 22 322 191 112 1:43:4
11 1 4 44 322 191 112 1:43:4
12 1 5 1 322 191 112 1:43:4
12 1 4 3 322 191 112 1:43:4
12 1 6 33 322 191 112 1:43:4
12 1 6 1 322 191 112 1:43:4
12 1 5 3 322 191 112 1:43:4
12 1 6 33 322 191 112 1:43:4
.
.
Third column has numbers between 1 to 6. I want to read information of columns #4 and #5 for all the rows that have number 1 to 6 in the third columns and find the maximum and minmum amount for each row that has number 1 to 6 seprately. For example output like this:
Mix for row with 1: 1
Max for row with 1: 90
Min for row with 2: 3
Max for row with 2: 55
and so on
I can plot the figure using following code. How to get summary statistics by group? What I'm looking for is to get multiple statistics for the same group like mean, min, max, number of each group in one call, is that doable?
import matplotlib.pyplot as plt
import csv
x= []
y= []
with open('mydata.csv','r') as csvfile:
ap = csv.reader(csvfile, delimiter=',')
for row in ap:
x.append(int(row[2]))
y.append(int(row[7]))
plt.scatter(x, y, color = 'g',s = 4, marker='o')
plt.show()
CodePudding user response:
One easy way would be to use Pandas with read_csv()
, .groupby()
and .agg()
:
import pandas as pd
df = pd.read_csv("mydata.csv", header=None)
def min_max_avg(col):
return (col.min() col.max()) / 2
result = df[[2, 3, 4]].groupby(2).agg(["min", "max", "mean", min_max_avg])
Result:
3 4
min max mean min_max_avg min max mean min_max_avg
2
1 1 90 33.666667 45.5 464 525 504.666667 494.5
2 3 55 28.000000 29.0 254 525 434.666667 389.5
3 3 30 18.333333 16.5 322 392 345.333333 357.0
4 3 44 23.333333 23.5 322 322 322.000000 322.0
5 1 3 2.000000 2.0 322 322 322.000000 322.0
6 1 33 22.333333 17.0 322 322 322.000000 322.0
If you don't like that you could do it with pure Python, it's only a little bit more work:
import csv
data = {}
with open("mydata.csv", "r") as file:
for row in csv.reader(file):
dct = data.setdefault(row[2], {})
for col in (3, 4):
dct.setdefault(col, []).append(row[col])
min_str = "Min for group {} - column {}: {}"
max_str = "Max for group {} - column {}: {}"
for row in data:
for col in (3, 4):
print(min_str.format(row, col, min(data[row][col])))
print(max_str.format(row, col, max(data[row][col])))
Result:
Min for group 1 - column 3: 1
Max for group 1 - column 3: 90
Min for group 1 - column 4: 464
Max for group 1 - column 4: 525
Min for group 2 - column 3: 26
Max for group 2 - column 3: 55
Min for group 2 - column 4: 254
Max for group 2 - column 4: 525
Min for group 3 - column 3: 22
Max for group 3 - column 3: 30
Min for group 3 - column 4: 322
Max for group 3 - column 4: 392
...
mydata.csv
:
0,2,1,1,464,385,171,0:44:4
1,1,2,26,254,444,525,0:56:2
2,3,1,90,525,785,522,0:52:8
3,8,2,3,525,233,555,0:52:8
4,7,1,10,525,433,522,1:52:8
5,9,2,55,525,555,522,1:52:8
6,6,3,3,392,111,232,1:43:4
7,1,4,23,322,191,112,1:43:4
8,1,3,30,322,191,112,1:43:4
9,1,5,2,322,191,112,1:43:4
10,1,3,22,322,191,112,1:43:4
11,1,4,44,322,191,112,1:43:4
12,1,5,1,322,191,112,1:43:4
12,1,4,3,322,191,112,1:43:4
12,1,6,33,322,191,112,1:43:4
12,1,6,1,322,191,112,1:43:4
12,1,5,3,322,191,112,1:43:4
12,1,6,33,322,191,112,1:43:4