Home > Software engineering >  Combine multiple csv files and add/average columns based on matching data
Combine multiple csv files and add/average columns based on matching data

Time:05-02

I have a bunch of different csv files in a directory, the first column contains names, the second and third columns contain numbers. I would like to combine all of the rows with same names, and then average the value in the 2nd column and add the value in the 3rd.

example csv 1:

John 5 5
Mark 10 5
Peter 20 5

example csv 2:

John 10 5
Mark 20 5
Peter 30 5

Expected combined file:

John 7.5 10
Mark 15 10
Peter 25 10

I have this so far, which combines all the files into one:

    os.chdir(r'\Pathtofiles')
    extension = 'csv'
    all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
    combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])

I get this result:

John 5 5
Mark 10 5
Peter 20 5
John 10 5
Mark 20 5
Peter 30 5

but not sure how to get my expected result

CodePudding user response:

You created DataFrame with all data so now you can use groupby() to grup by name and next you can use .agg() to run different (aggregation) functions on differen columns in groups.

new_df = df.groupby('name').agg({'first':'mean', 'second':'sum'}).reset_index()

Full working example with data directly in code

import pandas as pd

df = pd.DataFrame([
    ['John', 5, 5],
    ['Mark', 10, 5],
    ['Peter', 20, 5],
    ['John', 10, 5],
    ['Mark', 20, 5],
    ['Peter', 30, 5]
], columns=['name', 'first', 'second'])
print(df)

new_df = df.groupby('name').agg({'first':'mean', 'second':'sum'}).reset_index()

print(new_df)

Result:

    name  first  second
0   John      5       5
1   Mark     10       5
2  Peter     20       5
3   John     10       5
4   Mark     20       5
5  Peter     30       5

    name  first  second
0   John    7.5      10
1   Mark   15.0      10
2  Peter   25.0      10

CodePudding user response:

Assuming that the first csv file is named D1 and the second one is D2, and assuming that the column names in the csv files are ['name','num1','num2']:

New_Date=D1.merge(D2,left_on=("name"),right_on=("name"),suffixes=("_left","_right"))
New_Date["Average"]=(New_Date["num1_left"] New_Date["num1_right"])/2
New_Date["Sum"]=New_Date["num2_left"] New_Date["num2_right"]
New_Date.drop(columns=['num1_left', 'num2_left', 'num2_right', 'num1_right'],inplace=True)

CodePudding user response:

Use csv module which has csv.reader and csv.writer. os.listdir(path) is used to iterate over files in a particular directory you also need to perform checks weather the file is a directory or not and read permission is granted or not and also ensure that the file is actually a csv file.

person = dict() is a dictionary that holds name(str) as key a truple of 2 values as value. We initialize the person dictionary while reading from csv file.

import csv
import os

path = "/home/devp/Documents/code/so/python/files"

person = dict()

for file_name in os.listdir(path):
    file_path = os.path.join(path, file_name)
    print(file_path)

    with open(file_path, 'r') as csvfile:
    # creating a csv reader object
        csvreader = csv.reader(csvfile)

        rows = list()
        # extracting each data row one by one

        for row in csvreader:
            if len(row) == 1:
                row = row[0].split(' ')
                if len(row) >2:
                    name = row[0]
                    print(f"{name=}")
                    n1 = float(row[1])
                    n2 = float(row[2])
                    if name in person:
                        person[name] = ((person[name][0] n1)/2, (person[name][1] n2)/2)
                    else:
                        person[name] = (n1, n2) 

print(f"{person=}")

with open(os.path.join(path, 'all.csv'), 'w') as f:
    csvwriter = csv.writer(f)
    for name in person:
        csvwriter.writerow([name, person[name][0], person[name][1]])

Output:

$ cat files/all.csv 
John,7.5,5.0
Mark,15.0,5.0
Peter,25.0,5.0
  • Related