Home > Software engineering >  Pandas Pivot/Reshape/... GroupyBy rows to Columns [duplicate]
Pandas Pivot/Reshape/... GroupyBy rows to Columns [duplicate]

Time:10-03

[First off, these are my first "real" experiments with pandas, so the terminology in this question might be off.]

I am working with the GHCN weather data (https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/). The data consists of a CSV file that I load into a data frame (simplified here):

data = {
  'station': {0: 'AE000041196', 1: 'AE000041196', 2: 'AE000041196', 3: 'AEM00041194', 4: 'AEM00041194', 5: 'AEM00041194', 6: 'AEM00041194', 7: 'AEM00041217', 8: 'AEM00041217', 9: 'AEM00041217'},
  'date': {0: 20210101, 1: 20210101, 2: 20210102, 3: 20210103, 4: 20210101, 5: 20210101, 6: 20210101, 7: 20210101, 8: 20210101, 9: 20210101},
  'measurement': {0: 'TMAX', 1: 'PRCP', 2: 'TAVG', 3: 'TMAX', 4: 'TMIN', 5: 'PRCP', 6: 'TAVG', 7: 'TMAX', 8: 'TMIN', 9: 'TAVG'},
  'value': {0: 278, 1: 0, 2: 214, 3: 266, 4: 178, 5: 0, 6: 217, 7: 262, 8: 155, 9: 202}
}
df = pd.DataFrame(data)

Each row specifies a station and a date, as well as a measurement type and its value. In the actual data, there are around 50 different measurement types. I need to turn this data into a more "traditional" format where each column is one measurement type, and each row contains the data for a given station and date.

So far I can only come up with a manual approach that is terribly slow:

result = pd.DataFrame()
for key, item in df.groupby(['station', 'date']):
    group = input_df.get_group(key)
    vals = {}
    for idx, row in group.iterrows():
        vals["station"] = row[0]
        vals["date"] = row[1]
        vals[row[2]] = row[3]
    result = result.append(vals, ignore_index=True)

It works, but surely there must be a more "pandas" way of doing this, and ideally also allowing parallel processing using multiple CPU cores?

CodePudding user response:

You have the right terminology, and looking for pivot in the docs would probably have led you straight to using:

>>> df.pivot(index=['station', 'date'], columns='measurement', values='value')
measurement           PRCP   TAVG   TMAX   TMIN
station     date                               
AE000041196 20210101   0.0    NaN  278.0    NaN
            20210102   NaN  214.0    NaN    NaN
AEM00041194 20210101   0.0  217.0    NaN  178.0
            20210103   NaN    NaN  266.0    NaN
AEM00041217 20210101   NaN  202.0  262.0  155.0

You can possibly add .reset_index() at the end to transform the index columns back to normal columns:

>>> _.reset_index()
measurement      station      date  PRCP   TAVG   TMAX   TMIN
0            AE000041196  20210101   0.0    NaN  278.0    NaN
1            AE000041196  20210102   NaN  214.0    NaN    NaN
2            AEM00041194  20210101   0.0  217.0    NaN  178.0
3            AEM00041194  20210103   NaN    NaN  266.0    NaN
4            AEM00041217  20210101   NaN  202.0  262.0  155.0

There’s also a reshaping user guide in the documentation that’s very well done.

  • Related