Home > other >  Merge all .csv files in folder by a common field present in each file
Merge all .csv files in folder by a common field present in each file

Time:10-19

So, I have a directory with .csv files. For example:

a.csv

id,name
1,john
2,mary
3,alex

b.csv

id,birth
1,01.01.2001
2,05.06.1990

c.csv

id,death
2,01.02.2020
1,-

The result should be one dict where the key is id (int) and value is a dict of all the different values across the the files(dict of dicts). Something like this:

{
        1: {"id": 1, "name": "john", "birth": "01.01.2001", "death": -},
        2: {"id": 2, "name": "mary", "birth": "05.06.1990",
            "death": "01.02.2020"},
        3: {"id": 3, "name": "alex", "birth": None, "death": None},
}

So far I've tried to merge all files into one dataframe:

from pathlib import Path
import os
import pandas as pd

files = Path(r'path').rglob('*.csv')

# read in all the csv files
all_csvs = [pd.read_csv(file) for file in files]

# lump into one table
all_csvs = pd.concat(all_csvs, axis=1)

But as a result I get a dataframe where 'id' is repeated in three columns.

Any help will be appreciated!

CodePudding user response:

You want merge and not concat. Since you need to merge multiple DataFrames, you can do:

import os
from functools import reduce

all_csvs = [pd.read_csv(file) for file in os.listdir() if file.endswith(".csv")]
df = reduce(lambda left, right: pd.merge(left, right, how="outer", on="id"), all_csvs)

>>> df

   id  name       birth       death
0   1  john  01.01.2001         NaN
1   2  mary  05.06.1990  01.02.2020
2   3  alex         NaN         NaN

#for dictionary output replacing nan with None
my_dict = df.where(df.notnull(), None).set_index("id", drop=False).to_dict(orient="index")
>>> my_dict

{1: {'id': 1, 'name': 'john', 'birth': '01.01.2001', 'death': None},
 2: {'id': 2, 'name': 'mary', 'birth': '05.06.1990', 'death': '01.02.2020'},
 3: {'id': 3, 'name': 'alex', 'birth': None, 'death': None}}

CodePudding user response:

You can even do it without pandas if you're so inclined. First, create a defaultdict that'll hold all your csv data. Let the default element of this dict be a dictionary representing a "default" person, i.e. with all keys having a value of None.

import collections

def default_person():
    return {'id': None, 'name': None, 'birth': None, 'death': None}
all_csvs = collections.defaultdict(default_person)

The keys in this dict will be the id field, and values will be the dict containing all the information you want.

Next, read each file using csv.DictReader. DictReader reads each row of your csv file as a dictionary, with keys coming from the file's header. Then for each row in each file, update the values of the dictionary at the correct id in the defaultdict we just created:

import csv

files = Path(r'path').rglob('*.csv')

for file in files:
    with open(file, "r") as f_in:
        reader = csv.DictReader(f_in)
        for row_dict in reader:
            p_id = row_dict['id'] = int(row_dict['id']) # Convert `id` to integer
            all_csvs[p_id].update(row_dict)

Now, all_csvs looks like so:

defaultdict(<function __main__.default_person()>,
            {
             1: {'id': 1, 'name': 'john', 'birth': '01.01.2001', 'death': '-'},
             2: {'id': 2, 'name': 'mary', 'birth': '05.06.1990', 'death': '01.02.2020'},
             3: {'id': 3, 'name': 'alex', 'birth': None, 'death': None}
            })
  • Related