Home > other >  Getting the sum of a csv column without pandas in python
Getting the sum of a csv column without pandas in python

Time:08-18

I have a csv file passed into a function as a string:

csv_input = """
            quiz_date,location,size
            2022-01-01,london_uk,134
            2022-01-02,edingburgh_uk,65
            2022-01-01,madrid_es,124
            2022-01-02,london_uk,125
            2022-01-01,edinburgh_uk,89
            2022-01-02,madric_es,143
            2022-01-02,london_uk,352
            2022-01-01,edinburgh_uk,125
            2022-01-01,madrid_es,431
            2022-01-02,london_uk,151"""

I want to print the sum of how many people were surveyed in each city by date, so something like:

Date.         City.       Pop-Surveyed
2022-01-01.   London.     134
2022-01-01.   Edinburgh.  214
2022-01-01.   Madrid.     555
2022-01-02.   London.     628
2022-01-02.   Edinburgh.  65
2022-01-02.   Madrid.     143

As I can't import pandas on my machine (can't install without internet access) I thought I could use a defaultdict to store the value of each city by date

from collections import defaultdict

survery_data = csv_input.split()[1:]
survery_data = [survey.split(',') for survey in survery_data]

survey_sum = defaultdict(dict)

for survey in survery_data:
    date = survey[0]
    city = survey[1].split("_")[0]
    quantity = survey[-1]

    survey_sum[date][city]  = quantity

print(survey_sum)

But doing this returns a KeyError:

KeyError: 'london'

When I was hoping to have a defaultdict of

{'2022-01-01': {'london': 134}, {'edinburgh': 214}, {'madrid': 555}},
{'2022-01-02': {'london': 628}, {'edinburgh': 65}, {'madrid': 143}}

Is there a way to create a default dict that gives a structure so I could then iterate over to print out each column like above?

CodePudding user response:

Try:

csv_input = """\
            quiz_date,location,size
            2022-01-01,london_uk,134
            2022-01-02,edingburgh_uk,65
            2022-01-01,madrid_es,124
            2022-01-02,london_uk,125
            2022-01-01,edinburgh_uk,89
            2022-01-02,madric_es,143
            2022-01-02,london_uk,352
            2022-01-01,edinburgh_uk,125
            2022-01-01,madrid_es,431
            2022-01-02,london_uk,151"""


header, *rows = (
    tuple(map(str.strip, line.split(",")))
    for line in map(str.strip, csv_input.splitlines())
)

tmp = {}
for date, city, size in rows:
    key = (date, city.split("_")[0])
    tmp[key] = tmp.get(key, 0)   int(size)

out = {}
for (date, city), size in tmp.items():
    out.setdefault(date, []).append({city: size})

print(out)

Prints:

{
    "2022-01-01": [{"london": 134}, {"madrid": 555}, {"edinburgh": 214}],
    "2022-01-02": [{"edingburgh": 65}, {"london": 628}, {"madric": 143}],
}

CodePudding user response:

Changing

survey_sum = defaultdict(dict)

to

survey_sum = defaultdict(lambda: defaultdict(int))

allows the return of

defaultdict(<function survey_sum.<locals>.<lambda> at 0x100edd8b0>, {'2022-01-01': defaultdict(<class 'int'>, {'london': 134, 'madrid': 555, 'edinburgh': 214}), '2022-01-02': defaultdict(<class 'int'>, {'edingburgh': 65, 'london': 628, 'madrid': 143})})

Allowing iterating over to create a list.

  • Related