So, I have a directory with .csv files. For example:
a.csv
id,name
1,john
2,mary
3,alex
b.csv
id,birth
1,01.01.2001
2,05.06.1990
c.csv
id,death
2,01.02.2020
1,-
The result should be one dict where the key is id (int) and value is a dict of all the different values across the the files(dict of dicts). Something like this:
{
1: {"id": 1, "name": "john", "birth": "01.01.2001", "death": -},
2: {"id": 2, "name": "mary", "birth": "05.06.1990",
"death": "01.02.2020"},
3: {"id": 3, "name": "alex", "birth": None, "death": None},
}
So far I've tried to merge all files into one dataframe:
from pathlib import Path
import os
import pandas as pd
files = Path(r'path').rglob('*.csv')
# read in all the csv files
all_csvs = [pd.read_csv(file) for file in files]
# lump into one table
all_csvs = pd.concat(all_csvs, axis=1)
But as a result I get a dataframe where 'id' is repeated in three columns.
Any help will be appreciated!
CodePudding user response:
You want merge
and not concat
. Since you need to merge multiple DataFrames, you can do:
import os
from functools import reduce
all_csvs = [pd.read_csv(file) for file in os.listdir() if file.endswith(".csv")]
df = reduce(lambda left, right: pd.merge(left, right, how="outer", on="id"), all_csvs)
>>> df
id name birth death
0 1 john 01.01.2001 NaN
1 2 mary 05.06.1990 01.02.2020
2 3 alex NaN NaN
#for dictionary output replacing nan with None
my_dict = df.where(df.notnull(), None).set_index("id", drop=False).to_dict(orient="index")
>>> my_dict
{1: {'id': 1, 'name': 'john', 'birth': '01.01.2001', 'death': None},
2: {'id': 2, 'name': 'mary', 'birth': '05.06.1990', 'death': '01.02.2020'},
3: {'id': 3, 'name': 'alex', 'birth': None, 'death': None}}
CodePudding user response:
You can even do it without pandas if you're so inclined. First, create a defaultdict
that'll hold all your csv data. Let the default element of this dict be a dictionary representing a "default" person, i.e. with all keys having a value of None
.
import collections
def default_person():
return {'id': None, 'name': None, 'birth': None, 'death': None}
all_csvs = collections.defaultdict(default_person)
The keys in this dict will be the id
field, and values will be the dict containing all the information you want.
Next, read each file using csv.DictReader
. DictReader
reads each row of your csv file as a dictionary, with keys coming from the file's header. Then for each row in each file, update the values of the dictionary at the correct id
in the defaultdict
we just created:
import csv
files = Path(r'path').rglob('*.csv')
for file in files:
with open(file, "r") as f_in:
reader = csv.DictReader(f_in)
for row_dict in reader:
p_id = row_dict['id'] = int(row_dict['id']) # Convert `id` to integer
all_csvs[p_id].update(row_dict)
Now, all_csvs
looks like so:
defaultdict(<function __main__.default_person()>,
{
1: {'id': 1, 'name': 'john', 'birth': '01.01.2001', 'death': '-'},
2: {'id': 2, 'name': 'mary', 'birth': '05.06.1990', 'death': '01.02.2020'},
3: {'id': 3, 'name': 'alex', 'birth': None, 'death': None}
})