How to convert "event" data into country-year data by summating information in columns? Us-CodePudding

I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to summate the columns nkill, nhostage, and nwounded per year. This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).

Right now my data looks like this (there are a ton of other columns but they aren't important for this):

eventID	iyear	country_txt	nkill	nwounded	nhostages
10000102	2000	Nigeria	3	10	0
10000103	2000	Mali	1	3	15
10000103	2000	Nigeria	15	0	0
10000103	2001	Benin	1	0	0
10000103	2001	Nigeria	1	3	15

. . .

And I would like it to look like this:

country_txt	iyear	total_nkill	total_nwounded	total_nhostages
Nigeria	2000	200	300	300
Nigeria	2001	250	450	15

So basically, I want to add up the number of nkill, nwounded, and nhostages for each country-year group. So then I can have a list of all the countries and years with information about the number of deaths, injuries, and hostages taken per year in total. The countries also have an associated number if it is easier to write the code with a number instead of country_txt, the column with the country's number is just "country".

For a solution, I've been looking at the pandas "groupby" function, but I'm really new to coding so I'm having trouble understanding the documentation. It also seems like melt or pivot functions could be helpful.

CodePudding user response：

This simplified example shows how you could use groupby -

import pandas as pd

df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Mali'],
        'year': [2000, 2000, 2001, 2000],
        'events1': [ 3, 4, 5, 2],
        'events2': [1, 6, 3, 4]
       })


df2 = df.groupby(['country', 'year'])[['events1', 'events2']].sum()

print(df2)

which gives the total of each type of event by country and by year

              events1  events2
country year                  
Mali    2000        2        4
Nigeria 2000        7        7
        2001        5        3