What is the most Pythonic way to dynamically create a DataFrame containing person age in month?-CodePudding

I have a list of people with their firstname, lastname and their date of birth in a DataFrame.

data = [
    ["John",   "Wayne",   "13.12.2018"],
    ["Max",    "Muster",  "02.06.2016"],
    ["Steve",  "Black",   "11.04.2017"],
    ["Amy",    "Smith",   "10.10.2017"],
    ["July",   "House",   "08.05.2018"],
    ["Anna",   "Whine",   "20.08.2016"],
    ["Charly", "Johnson", "16.07.2016"],
]

people = pd.DataFrame(
    data,
    columns=["first", "last", "birthdate"],
)

people["birthdate"] = pd.to_datetime(people["birthdate"], format="%d.%m.%Y")

    first     last  birthdate
0    John    Wayne 2018-12-13
1     Max   Muster 2016-06-02
2   Steve    Black 2017-04-11
3     Amy    Smith 2017-10-10
4    July    House 2018-05-08
5    Anna    Whine 2016-08-20
6  Charly  Johnson 2016-07-16

I would like to create another dataframe having the same rows but the months of a year as columns. The data should be the people's age at the end of the month.

Here is what I'm currently doing

# generate series for all months
months = pd.date_range("2022-01-01", "2022-12-01", freq="MS")

# calculate age for every person
age = pd.DataFrame(data={"first": people["first"], "last": people["last"]})
for value in months:
    last_day_of_month = value   pd.offsets.MonthEnd()
    age[value.strftime("%b")] = (last_day_of_month - people["birthdate"]).astype(
        "timedelta64[Y]"
    )

    first     last  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
0    John    Wayne  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  4.0
1     Max   Muster  5.0  5.0  5.0  5.0  5.0  6.0  6.0  6.0  6.0  6.0  6.0  6.0
2   Steve    Black  4.0  4.0  4.0  5.0  5.0  5.0  5.0  5.0  5.0  5.0  5.0  5.0
3     Amy    Smith  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  5.0  5.0  5.0
4    July    House  3.0  3.0  3.0  3.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0
5    Anna    Whine  5.0  5.0  5.0  5.0  5.0  5.0  5.0  6.0  6.0  6.0  6.0  6.0
6  Charly  Johnson  5.0  5.0  5.0  5.0  5.0  5.0  6.0  6.0  6.0  6.0  6.0  6.0

That works fine but I was wondering if there is a more pythonic way to solve my problem. The for loop is certainly something I would use in other programming languages but I thought "Maybe there is a smarter way to solve this ...".

Also another general question:

Would you rather use the columns for the months or the rows? I'm new to Python and Pandas and was wondering if there are some best practices around time series data modelling.

Thank you very much!

CodePudding user response：

You can try to vectorize all you operations using numpy broadcasting:

months = pd.date_range("2022-01-01", "2022-12-01", freq="ME")

idx = pd.MultiIndex.from_frame(people[['first', 'last']])

out = (pd.DataFrame(
 months.to_numpy() -
 people[['birthdate']].to_numpy(),
 index=idx,
 columns=months.strftime('%b')
            )
          .astype("timedelta64[Y]")
          .reset_index()
       )

print(out)

Output:

    first     last  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
0    John    Wayne  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0
1     Max   Muster  5.0  5.0  5.0  5.0  5.0  5.0  6.0  6.0  6.0  6.0  6.0  6.0
2   Steve    Black  4.0  4.0  4.0  4.0  5.0  5.0  5.0  5.0  5.0  5.0  5.0  5.0
3     Amy    Smith  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  5.0  5.0
4    July    House  3.0  3.0  3.0  3.0  3.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0
5    Anna    Whine  5.0  5.0  5.0  5.0  5.0  5.0  5.0  5.0  6.0  6.0  6.0  6.0
6  Charly  Johnson  5.0  5.0  5.0  5.0  5.0  5.0  5.0  6.0  6.0  6.0  6.0  6.0