Extracting several Tables from same spreadsheet and merging them to get a single continuous Data fra-CodePudding

I have a Spreadsheet of Time series Data

It has separate tables for each year with a single row gap in between. I want to have the Year from table header as part of the date column. So that I can plot charts and do simple comparisons of the data (YoY) etc

import pandas as pd

RigCountWorld_df = pd.read_excel(open('Worldwide Rig Count Nov 2022.xlsx', 'rb'),
              sheet_name='Worldwide_Rigcount',index_col=None, header=6)

RigCountWorld_df

The code I have is no where near helpful. Even the names of pandas operations I need to use will be helpful for me.

I need a continuous table with data from all years. It would make sense to have latest data at the very end.

Even transposing the tables separately and adding them as new columns would make sense (with the column headers containing Year-Month names.

CodePudding user response：

You can use the pandas melt() function to reshape the data from wide to long format, so that each row contains a single observation. From there, you can use the pandas concat() function to join the melted dataframes from each sheet together into a single dataframe.

To add the year from the sheet header to the date column, you can use the pandas assign() function to create a new column with the year from the sheet name. Then, use the pandas datetime.strptime() function to convert the existing date column into datetime objects with the year from the new column.

Here's an example of how to do it:

import pandas as pd

# create a list of sheet names
sheets = ['Worldwide_Rigcount_2020', 'Worldwide_Rigcount_2021', 'Worldwide_Rigcount_2022']

# create an empty list to store the dataframes
dfs = []

# loop through each sheet
for sheet in sheets:
    # read the sheet into a dataframe
    df = pd.read_excel('Worldwide Rig Count Nov 2022.xlsx', sheet_name=sheet, index_col=None, header=6)
    
    # extract the year from the sheet name
    year = int(sheet.split('_')[-1])
    
    # melt the dataframe from wide to long format
    df_melted = df.melt(id_vars=['Date'])
    
    # add the year from the sheet name to the dataframe
    df_melted = df_melted.assign(year=year)
    
    # convert the date column to datetime objects
    df_melted['Date'] = pd.to_datetime(df_melted['Date'], format='%b %Y', errors='coerce').dt.strftime('%Y-%m-%d')   '-'   df_melted['year'].astype(str)
    
    # append the dataframe to the list
    dfs.append(df_melted)
    
# concat the list of dataframes into a single dataframe
df_final = pd.concat(dfs)

# print the final dataframe
print(df_final)

CodePudding user response：

Here is a proposition with some of pandas built-in functions.

import pandas as pd

df = pd.read_excel("Worldwide Rig Count Nov 2022.xlsx",
                   sheet_name="Worldwide_Rigcount", header=None, usecols="B:K", skiprows=6)
    
df.dropna(how="all", inplace=True)
df.insert(0, "Year", np.where(df[10].eq("Total World"), df[1], None))
df["Year"].ffill(inplace=True)
df.drop_duplicates(subset= df.columns[2:], inplace=True)
df.columns = ["Year", "Month"]   df.loc[0, 2:].tolist()
df = df.loc[1:, :].reset_index(drop=True)

# Output :

print(df.sample(5).to_string())

     Year Month Latin America Europe Africa Middle East Asia Pacific Total Intl. Canada  U.S. Total World
613  1975   Mar           310    113    122         173          208         926    192  1651        2769
588  1977   Apr           324    135    165         185          167         976    129  1907        3012
596  1977   Dec           353    142    172         195          182        1044    259  2141        3444
221  2005   Jan           307     57     50         242          204         860    550  1255        2665
566  1979   Aug           440    149    199         144          219        1151    376  2222        3749

# Check :

48 years with 13 rows (12 months Average) for each year.

print(df.groupby("Year").size().value_counts())

13    48
dtype: int64

CodePudding user response：

Another solution validating year and month content (it is assumed, that the column names are in the first row of RigCountWorld_df):

df = RigCountWorld_df.copy()
first_col = 2  # First column with data
column_names = df.iloc[0, first_col:].to_list()
df["Year"] = df.iloc[:,[1]].where(df.iloc[:,1].astype(str).str.match(r"^20\d\d$"), None).ffill()
df["Month"] = df.iloc[:,[1]].where(df.iloc[:,1].astype(str).isin(("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")), None)
df = df[df['Month'].notna()]
df = df.iloc[:, first_col:].set_index(["Year", "Month"])
df.columns = column_names
df