I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to summate the columns nkill, nhostage, and nwounded per year. This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).
Right now my data looks like this (there are a ton of other columns but they aren't important for this):
eventID | iyear | country_txt | nkill | nwounded | nhostages |
---|---|---|---|---|---|
10000102 | 2000 | Nigeria | 3 | 10 | 0 |
10000103 | 2000 | Mali | 1 | 3 | 15 |
10000103 | 2000 | Nigeria | 15 | 0 | 0 |
10000103 | 2001 | Benin | 1 | 0 | 0 |
10000103 | 2001 | Nigeria | 1 | 3 | 15 |
. . .
And I would like it to look like this:
country_txt | iyear | total_nkill | total_nwounded | total_nhostages |
---|---|---|---|---|
Nigeria | 2000 | 200 | 300 | 300 |
Nigeria | 2001 | 250 | 450 | 15 |
So basically, I want to add up the number of nkill, nwounded, and nhostages for each country-year group. So then I can have a list of all the countries and years with information about the number of deaths, injuries, and hostages taken per year in total. The countries also have an associated number if it is easier to write the code with a number instead of country_txt, the column with the country's number is just "country".
For a solution, I've been looking at the pandas "groupby" function, but I'm really new to coding so I'm having trouble understanding the documentation. It also seems like melt or pivot functions could be helpful.
CodePudding user response:
This simplified example shows how you could use groupby -
import pandas as pd
df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Mali'],
'year': [2000, 2000, 2001, 2000],
'events1': [ 3, 4, 5, 2],
'events2': [1, 6, 3, 4]
})
df2 = df.groupby(['country', 'year'])[['events1', 'events2']].sum()
print(df2)
which gives the total of each type of event by country and by year
events1 events2
country year
Mali 2000 2 4
Nigeria 2000 7 7
2001 5 3