I have a bunch of different csv files in a directory, the first column contains names, the second and third columns contain numbers. I would like to combine all of the rows with same names, and then average the value in the 2nd column and add the value in the 3rd.
example csv 1:
John 5 5
Mark 10 5
Peter 20 5
example csv 2:
John 10 5
Mark 20 5
Peter 30 5
Expected combined file:
John 7.5 10
Mark 15 10
Peter 25 10
I have this so far, which combines all the files into one:
os.chdir(r'\Pathtofiles')
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
I get this result:
John 5 5
Mark 10 5
Peter 20 5
John 10 5
Mark 20 5
Peter 30 5
but not sure how to get my expected result
CodePudding user response:
You created DataFrame
with all data so now you can use groupby()
to grup by name
and next you can use .agg()
to run different (aggregation) functions on differen columns in groups.
new_df = df.groupby('name').agg({'first':'mean', 'second':'sum'}).reset_index()
Full working example with data directly in code
import pandas as pd
df = pd.DataFrame([
['John', 5, 5],
['Mark', 10, 5],
['Peter', 20, 5],
['John', 10, 5],
['Mark', 20, 5],
['Peter', 30, 5]
], columns=['name', 'first', 'second'])
print(df)
new_df = df.groupby('name').agg({'first':'mean', 'second':'sum'}).reset_index()
print(new_df)
Result:
name first second
0 John 5 5
1 Mark 10 5
2 Peter 20 5
3 John 10 5
4 Mark 20 5
5 Peter 30 5
name first second
0 John 7.5 10
1 Mark 15.0 10
2 Peter 25.0 10
CodePudding user response:
Assuming that the first csv file is named D1 and the second one is D2, and assuming that the column names in the csv files are ['name','num1','num2']:
New_Date=D1.merge(D2,left_on=("name"),right_on=("name"),suffixes=("_left","_right"))
New_Date["Average"]=(New_Date["num1_left"] New_Date["num1_right"])/2
New_Date["Sum"]=New_Date["num2_left"] New_Date["num2_right"]
New_Date.drop(columns=['num1_left', 'num2_left', 'num2_right', 'num1_right'],inplace=True)
CodePudding user response:
Use csv
module which has csv.reader
and csv.writer
. os.listdir(path)
is used to iterate over files in a particular directory you also need to perform checks weather the file is a directory or not and read permission is granted or not and also ensure that the file is actually a csv
file.
person = dict()
is a dictionary
that holds name
(str) as key a truple of 2 values as value. We initialize the person dictionary while reading from csv
file.
import csv
import os
path = "/home/devp/Documents/code/so/python/files"
person = dict()
for file_name in os.listdir(path):
file_path = os.path.join(path, file_name)
print(file_path)
with open(file_path, 'r') as csvfile:
# creating a csv reader object
csvreader = csv.reader(csvfile)
rows = list()
# extracting each data row one by one
for row in csvreader:
if len(row) == 1:
row = row[0].split(' ')
if len(row) >2:
name = row[0]
print(f"{name=}")
n1 = float(row[1])
n2 = float(row[2])
if name in person:
person[name] = ((person[name][0] n1)/2, (person[name][1] n2)/2)
else:
person[name] = (n1, n2)
print(f"{person=}")
with open(os.path.join(path, 'all.csv'), 'w') as f:
csvwriter = csv.writer(f)
for name in person:
csvwriter.writerow([name, person[name][0], person[name][1]])
Output:
$ cat files/all.csv
John,7.5,5.0
Mark,15.0,5.0
Peter,25.0,5.0