Home > Back-end >  python: dataframe restructuring
python: dataframe restructuring

Time:04-18

please i need help on the restructuring of the dataframe below, i would like to group by batch of 7 day(week) so by summing the column values for each 7 days and the 7th days represents the final index. so for example for the first 7 lines I will have

         Uruguay Uzbekistan 

2020-01-28   0         0

sum of value in colone from 2020-01-22 to 2020-01-28

you can see the dataframe in the snippet. thank

<table border="0" >
  <thead>
    <tr style="text-align: right;">
      <th>Country/Region</th>
      <th>Uruguay</th>
      <th>Uzbekistan</th>
      <th>Venezuela</th>
      <th>Vietnam</th>
      <th>West Bank and Gaza</th>
      <th>Western Sahara</th>
      <th>Yemen</th>
      <th>Zambia</th>
      <th>Zimbabwe</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2020-01-22</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2020-01-23</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2020-01-24</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2020-01-25</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2020-01-26</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>2020-11-03</th>
      <td>2788</td>
      <td>65389</td>
      <td>88751</td>
      <td>1104</td>
      <td>48245</td>
      <td>9</td>
      <td>1976</td>
      <td>16112</td>
      <td>8188</td>
    </tr>
    <tr>
      <th>2020-11-04</th>
      <td>2831</td>
      <td>65578</td>
      <td>89140</td>
      <td>1104</td>
      <td>48728</td>
      <td>9</td>
      <td>1976</td>
      <td>16168</td>
      <td>8215</td>
    </tr>
    <tr>
      <th>2020-11-05</th>
      <td>2875</td>
      <td>65833</td>
      <td>89520</td>
      <td>1104</td>
      <td>49188</td>
      <td>9</td>
      <td>1976</td>
      <td>16176</td>
      <td>8223</td>
    </tr>
    <tr>
      <th>2020-11-06</th>
      <td>2914</td>
      <td>65912</td>
      <td>89920</td>
      <td>1105</td>
      <td>50048</td>
      <td>9</td>
      <td>1977</td>
      <td>16211</td>
      <td>8233</td>
    </tr>
    <tr>
      <th>2020-11-07</th>
      <td>2964</td>
      <td>66131</td>
      <td>90326</td>
      <td>1105</td>
      <td>50487</td>
      <td>9</td>
      <td>1977</td>
      <td>16222</td>
      <td>8246</td>
    </tr>
  </tbody>
</table>

CodePudding user response:

This should give you some place to start. Please come back with a solid question in mind if you run into any specific issues:

pip install lxml

import pandas as pd

df = pd.read_html(your_html)[0]
print(df)

Output:

   Country/Region Uruguay Uzbekistan Venezuela Vietnam West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
0      2020-01-22       0          0         0       0                  0              0     0      0        0
1      2020-01-23       0          0         0       0                  0              0     0      0        0
2      2020-01-24       0          0         0       0                  0              0     0      0        0
3      2020-01-25       0          0         0       0                  0              0     0      0        0
4      2020-01-26       0          0         0       0                  0              0     0      0        0
5             ...     ...        ...       ...     ...                ...            ...   ...    ...      ...
6      2020-11-03    2788      65389     88751    1104              48245              9  1976  16112     8188
7      2020-11-04    2831      65578     89140    1104              48728              9  1976  16168     8215
8      2020-11-05    2875      65833     89520    1104              49188              9  1976  16176     8223
9      2020-11-06    2914      65912     89920    1105              50048              9  1977  16211     8233
10     2020-11-07    2964      66131     90326    1105              50487              9  1977  16222     8246

CodePudding user response:

If your each 7 days are all in the same week, you can try sum after group by then set the index by last row in each group.

g = df.groupby(df['Country/Region'].dt.isocalendar().week)
out = g.sum().set_index(g.tail(1)['Country/Region'])
print(out)

                Uruguay  Uzbekistan  Venezuela  Vietnam  West Bank and Gaza  Western Sahara  Yemen  Zambia  Zimbabwe
Country/Region
2020-01-26            0           0          0        0                   0               0      0       0         0
2020-11-07        14372      328843     447657     5522              246696              45   9882   80889     41105
  • Related