Home > OS >  How to convert "event" data into country-year data by summating information in columns? Us
How to convert "event" data into country-year data by summating information in columns? Us

Time:01-27

I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to summate the columns nkill, nhostage, and nwounded per year. This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).

Right now my data looks like this (there are a ton of other columns but they aren't important for this):

eventID iyear country_txt nkill nwounded nhostages
10000102 2000 Nigeria 3 10 0
10000103 2000 Mali 1 3 15
10000103 2000 Nigeria 15 0 0
10000103 2001 Benin 1 0 0
10000103 2001 Nigeria 1 3 15

. . .

And I would like it to look like this:

country_txt iyear total_nkill total_nwounded total_nhostages
Nigeria 2000 200 300 300
Nigeria 2001 250 450 15

So basically, I want to add up the number of nkill, nwounded, and nhostages for each country-year group. So then I can have a list of all the countries and years with information about the number of deaths, injuries, and hostages taken per year in total. The countries also have an associated number if it is easier to write the code with a number instead of country_txt, the column with the country's number is just "country".

For a solution, I've been looking at the pandas "groupby" function, but I'm really new to coding so I'm having trouble understanding the documentation. It also seems like melt or pivot functions could be helpful.

CodePudding user response:

This simplified example shows how you could use groupby -

import pandas as pd

df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Mali'],
        'year': [2000, 2000, 2001, 2000],
        'events1': [ 3, 4, 5, 2],
        'events2': [1, 6, 3, 4]
       })


df2 = df.groupby(['country', 'year'])[['events1', 'events2']].sum()

print(df2)

which gives the total of each type of event by country and by year

              events1  events2
country year                  
Mali    2000        2        4
Nigeria 2000        7        7
        2001        5        3
  • Related