Home > database >  Split - apply - save csv for pandas using pure pandas / apply
Split - apply - save csv for pandas using pure pandas / apply

Time:06-10

I have a boolean dataframe indexed by timestamps

df
>>>
timestamp     x0        x1        x2
2020-01-01    True      False     True
2020-01-02    True      False     True
2020-01-03    False     True      True

I want to save a csv of the column names for each row where the column is True, with the current timestamp as the csv filename. So in the example above, the desired output would be 3 csv's:

20200101.csv:

x0,
x2,

20200102.csv:

x0,
x2,

20200103.csv:

x1,
x2,

I have managed to do this using a for loop and some pandas methods, but it seems clunky. (This would be almost a one-liner in R, like using split and lapply.)

import numpy as np

for idx, row in df.iterrows():
    tmp = row.replace({False: np.nan}).dropna()
    tmp = pd.DataFrame({"my_col": tmp.index.tolist()})
    file_name = ''.join(str(idx.date()).split('-'))
    tmp.to_csv(f"{file_name}.csv", index=False)

Is there a clean way to do this using pure pandas / map reduce / pandas apply and avoiding for loops?

CodePudding user response:

I honestly don't know if it's appropriate to use .apply for side effects, but this works. row.index[row] uses the row as a boolean mask to filter the column labels, the rest should be self-explanatory.

df.apply(
    lambda row: row.index[row]
    .to_frame()
    .to_csv(f"{row.name.replace('-', '')}.csv", 
            index=False, 
            header=False),
    axis=1 # apply row-wise
)

CodePudding user response:

Had to stick with a loop to write out the CSVs.

df_out = df.melt(id_vars='timestamp').loc[lambda x: x['value']].sort_values('timestamp')

print(df_out)

    timestamp variable  value
0  2020-01-01       x0   True
6  2020-01-01       x2   True
1  2020-01-02       x0   True
7  2020-01-02       x2   True
5  2020-01-03       x1   True
8  2020-01-03       x2   True

Resorted to the much-maligned loop for output to CSV:

for t in df_out['timestamp'].unique():
    df_out.loc[df_out['timestamp']==t,'variable'] \
        .to_csv(re.sub('-','',fr'd:\jchtempnew\SO\{t}.csv'), 
                index=None, header=None, line_terminator=',\r\n')

20200101.csv:
x0,
x2,

20200102.csv:
x0,
x2,

20200103.csv: 
x1,
x2,
  • Related