Home > database >  Cannot append items to end of list and concatenate data frames
Cannot append items to end of list and concatenate data frames

Time:10-29

I am looping through data frame columns to obtain specific pieces of data - so far I have been successful except for the last three pieces of data I need. When I attempt to append these pieces of data to the list, they are appended to the beginning of the list and not at the end (I need them to be at the end).

Therefore, when I convert this list into a data frame and attempt to concatenate it to another data frame I have prepared, the values are all in the wrong places.

This is my code:

descs = ["a", "b", "c", "d", "e", "f", "g"]
data =[]
stats = []
for desc in descs:
    data.append({
        "Description":desc
    })
for column in df:
    if df[column].name == "column1":
        counts = df[column].value_counts()
        stats.extend([
            counts.sum(), 
            counts[True], 
            counts[False] 
        ])
    elif df[column].name == "date_column":
        stats.append(
            df[column].min().date()
        )
    #Everything is fine up until this `elif` block
    #I THINK THIS IS WHERE THE PROBLEM IS I DONT KNOW HOW TO FIX IT
    elif df[column].name == "column2":
        stats.extend([
            df[column].max() ,
            round(df[column].loc[df["column1"] == True].agg({column:"mean"}),2),
            round(df[column].loc[df["column1"] == False].agg({column:"mean"}),2)
        ])

Up until the second elifblock, when I run this code and concatenate data and stats as data frames pd.concat([pd.DataFrame(data), pd.DataFrame({"Statistic":stats}), axis = 1) I get the following output - which is the output I want:

Description Statistic
"a" 38495
"b" 3459
"c" 234
"d" 1984-06-2
"e" NaN
"f" NaN
"g" NaN

When I run the above code chunk including the second elif block, the output is messed up

Description Statistic
"a" [78, [454],[45]]
"b" 38495
"c" 3459
"d" 234
"e" 1984-06-2
"f" NaN
"g" NaN

Those values in the first index of the data frame [78, 454, 45] should be in the place (and in that order) where NaNs appear in the first table

What am I doing wrong?

CodePudding user response:

You're really close to making this work the way you want!

A couple things to make your life simpler:

  1. df[column].name isn't needed because you can just use column
  2. Looping through columns and having multiple if statements on their names to calculate summary statistics works, but you'll make your life easier if you look into .groupby() with .agg()

And that brings me to your issue: .agg() returns a pandas Series, and you just want a single number. Try

round(df[column].loc[df["column1"] == False].mean(),2)

instead. :)

Update: Now it looks like you are hitting the second elif with the first column, so re-order your DataFrame columns to be in the order you want them in:

cols = ["column1", "date_column", "column2"]
for column in cols:
  if column == "column1":
  • Related