I have a pandas dataframe with some rows containing json data
Date | message |
---|---|
2020-01-01 | { 'var1': 'foo'} |
2020-01-01 | different content |
2020-01-01 | { 'var2': 'foo'} |
2020-01-02 | non json content |
2020-01-02 | { 'var2': 'bar'} |
I would like to pivot the table so that I have a time series of the different variables, like
Date | var1 | var2 |
---|---|---|
2020-01-01 | foo | foo |
2020-01-02 | nan/empty | bar |
CodePudding user response:
Convert values to Series
with convert jsons to dicts, last get first non missing value per datetimes by GroupBy.first
:
import ast
def f(x):
try:
return pd.Series(ast.literal_eval(x))
except SyntaxError:
return pd.Series()
df = df.set_index('Date')['message'].apply(f).groupby(level=0).first()
print (df)
var1 var2
Date
2020-01-01 foo foo
2020-01-02 None bar
CodePudding user response:
Not sure what your data exactly looks like, but you could do the following
- Use
pd.json_normalize()
to parse some of the nested json content - Then use
groupby()
to get for example the first result
Here is an example:
import pandas as pd
#Input data
data = {
"Date": ["2020-01-01","2020-01-01","2020-01-01","2020-01-02","2020-01-02"],
"message": [{'var1': 'foo'}, {'stupidkey':'stupidval'}, {'var2':'foo'}, "bla", {'var2':'bar'}]
}
df = pd.DataFrame(data)
# Generate result
varsToKeep = ["var1", "var2"]
df_vars = pd.json_normalize(df["message"])[varsToKeep]
df = df.join(df_vars)
df_result = df.groupby("Date").first()[varsToKeep]