Home > Mobile >  Pandas removing rows with incomplete time series in panel data
Pandas removing rows with incomplete time series in panel data

Time:11-05

I have a dataframe along the lines of the below:

    Country1  Country2    Year
1    Italy       Greece   2000
2    Italy       Greece   2001  
3    Italy       Greece   2002
4    Germany      Italy   2000
5    Germany      Italy   2002
6    Mexico      Canada   2000
7    Mexico      Canada   2001
8    Mexico      Canada   2002
9    US          France   2000
10   US          France   2001
11   Greece      Italy    2000
12   Greece      Italy    2001

I want to keep only the rows in which there are observations for the entire time series (2000-2002). So, the end result would be:

    Country1  Country2    Year
1    Italy       Greece   2000
2    Italy       Greece   2001  
3    Italy       Greece   2002
4    Mexico      Canada   2000
5    Mexico      Canada   2001
6    Mexico      Canada   2002

CodePudding user response:

One idea is reshape by crosstab and test if rows has not 0 values by DataFrame.ne with DataFrame.all, convert index to DataFrame by MultiIndex.to_frame and last get filtered rows in DataFrame.merge:

df1 = pd.crosstab([df['Country1'], df['Country2']], df['Year'])

df = df.merge(df1.index[df1.ne(0).all(axis=1)].to_frame(index=False))
print (df)
  Country1 Country2  Year
0    Italy   Greece  2000
1    Italy   Greece  2001
2    Italy   Greece  2002
3   Mexico   Canada  2000
4   Mexico   Canada  2001
5   Mexico   Canada  2002

Or if need test some specific range is possible compare sets in GroupBy.transform:

r = set(range(2000, 2003))

df = df[df.groupby(['Country1', 'Country2'])['Year'].transform(lambda x: set(x) == r)]
print (df)
  Country1 Country2  Year
1    Italy   Greece  2000
2    Italy   Greece  2001
3    Italy   Greece  2002
6   Mexico   Canada  2000
7   Mexico   Canada  2001
8   Mexico   Canada  2002

CodePudding user response:

One option is to pivot the data, drop null rows and reshape back; this only works if the combination of Country* and Year is unique (in the sample data it is ):

(df.assign(dummy = 1)
   .pivot(('Country1', 'Country2'), 'Year')
   .dropna()
   .stack()
   .drop(columns='dummy')
   .reset_index()
)
 
  Country1 Country2  Year
0    Italy   Greece  2000
1    Italy   Greece  2001
2    Italy   Greece  2002
3   Mexico   Canada  2000
4   Mexico   Canada  2001
5   Mexico   Canada  2002
  • Related