I'm scraping data from a private API, I've managed to gather all the responses I need and then transform the json object into a dataframe, let's call it raw_df. What I'd like to do now is to convert this dataframe into another dataframe, called df, with a specific layout so it can be used in another area that uses excel only. Raw_df consists of 32 columns of which only 12 are useful to me, the desired output in df should be those 12 columns with changed headers, as well as changed datatypes. The main problem arises when raw_df is missing one of the columns that I need, which gives an error since I can't assign it to df (raw_df could be missing one or more columns, it depends on the api call), in this case it would mean that such column in df should be empty.
I've came up with a rather unelegant solution, I initialized an empty dataframe with 12 columns with their respective datatypes, then assign each column in raw_df to its counterpart in df. Then I added a try except to ignore the error in the cases in which there may not be a column in raw_df.
df = pd.DataFrame({"ID de respuesta": pd.Series(dtype='int'),
"TiendaID": pd.Series(dtype='int'),
"Dirección IP": pd.Series(dtype='str'),
"Marca de tiempo (dd/mm/yyyy)": pd.Series(dtype='float'),
"Tiempo necesario para completar (segundos)": pd.Series(dtype='int'),
"Latitud": pd.Series(dtype='float'),
"Longitud": pd.Series(dtype='float'),
"¿El vendedor te recomendó algún producto adicional?": pd.Series(dtype='int'),
"¿Cuán probable es que recomiendes las tiendas Sx a un familiar o amigo?": pd.Series(dtype='int'),
"¿Por qué nos felicitas?": pd.Series(dtype='str'),
"¿En qué debemos mejorar?": pd.Series(dtype='str'),
"¿En qué fallamos?": pd.Series(dtype='str')})
df["ID de respuesta"] = raw_df["responseID"]
df["TiendaID"] = raw_df["surveyID"]
df["Dirección IP"] = raw_df["ipAddress"]
df["Marca de tiempo (dd/mm/yyyy)"] = raw_df["timestamp"]
df["Tiempo necesario para completar (segundos)"] = raw_df["timeTaken"]
df["Latitud"] = raw_df["location.latitude"]
df["Longitud"] = raw_df["location.longitude"]
df["¿El vendedor te recomendó algún producto adicional?"] = raw_df["¿El vendedor te recomendó algún producto adicional? "]
df["¿Cuán probable es que recomiendes las tiendas Sx a un familiar o amigo?"] = raw_df["¿Cuán probable es que recomiendes las tiendas Sx a un familiar o amigo?"]
try:
df["¿Por qué nos felicitas?"] = raw_df["¿Por qué nos felicitas?"]
df["¿En qué debemos mejorar?"] = raw_df["¿En qué debemos mejorar?"]
df["¿En qué fallamos?"] = raw_df["¿En qué fallamos?"]
except Exception:
pass
I would like help to find a more pythonic way to do this, since I think this is not the best way. Thank you for your help.
CodePudding user response:
You could e.g. specify the mapping from your column names in df
to the respective column names in raw_df
in a dictionary d
and then iterate over the key
, val
pairs of it and first check whether the column is present in raw_df
with an if statement, and only assign it in that case, as follows:
d = {
"ID de respuesta": "responseID",
"TiendaID" = "surveyID",
...
}
for key, val in d.items():
if val in raw_df.columns:
df[key] = raw_df[val]