How do you select just the older date in a dataframe using python?-CodePudding

I am starting in datetime analysis using python. I have two columns in a dataframe

 import pandas as pd


a= [('2021-06-11 00:00:00'),
 ('2021-06-09 00:00:00'),
 ('2021-06-10 00:00:00'),
 ('2021-06-11 00:00:00'),
 ('2021-06-12 00:00:00'),
 ('2021-06-13 00:00:00'),
 ('2021-06-14 00:00:00'),
 ('2021-06-15 00:00:00'),
 ('2021-06-16 00:00:00'),
 ('2021-06-17 00:00:00'),
 ('2021-06-18 00:00:00'),
 ('2021-06-19 00:00:00'),
 ('2021-06-20 00:00:00')]
b = ["a", "a", "a", "a", "a", "a", "a", "b", "b","b" ,"b", "b"]

fecha = pd.DataFrame(list(zip(a,b)))
fecha[0] = pd.to_datetime(fecha[0])

I would like to select just the rows wich contains the older dates for "a" and "b", i try some metodos but i could not doing it.

Thank you!

CodePudding user response：

Simpler method would be to sort the dataframe according to the column you like then fetch the values of it against a

import pandas as pd

a= [('2021-06-11 00:00:00'),
 ('2021-06-09 00:00:00'),
 ('2021-06-10 00:00:00'),
 ('2021-06-11 00:00:00'),
 ('2021-06-12 00:00:00'),
 ('2021-06-13 00:00:00'),
 ('2021-06-14 00:00:00'),
 ('2021-06-15 00:00:00'),
 ('2021-06-16 00:00:00'),
 ('2021-06-17 00:00:00'),
 ('2021-06-18 00:00:00'),
 ('2021-06-19 00:00:00'),
 ('2021-06-20 00:00:00')]
b = ["a", "a", "a", "a", "a", "a", "a", "b", "b","b" ,"b", "b","b"]

fecha = pd.DataFrame({"a":a,"b":b})

fecha = fecha.sort_values(by=['a'])

fecha.head(3).where(fecha["b"] == "a")

and the output is

a   b
1   2021-06-09 00:00:00 a
2   2021-06-10 00:00:00 a
0   2021-06-11 00:00:00 a

methods used : sort_values where

CodePudding user response：

I converted a to datetime objects and did it that way. Also, I had to add an extra item to your list b as the lengths a and b were different.

import datetime

a= [('2021-06-11 00:00:00'),
    ('2021-06-09 00:00:00'),
    ('2021-06-10 00:00:00'),
    ('2021-06-11 00:00:00'),
    ('2021-06-12 00:00:00'),
    ('2021-06-13 00:00:00'),
    ('2021-06-14 00:00:00'),
    ('2021-06-15 00:00:00'),
    ('2021-06-16 00:00:00'),
    ('2021-06-17 00:00:00'),
    ('2021-06-18 00:00:00'),
    ('2021-06-19 00:00:00'),
    ('2021-06-20 00:00:00')]
b = ["a", "a", "a", "a", "a", "a", "a", "b", "b", "b" ,"b", "b", "b"]

a0 = [datetime.datetime.strptime(i, "%Y-%m-%d %H:%M:%S") for i in a]

print("min a:", min([a0[i] for i in range(len(a0)) if b[i] == "a"]))
print("min b:", min([a0[i] for i in range(len(a0)) if b[i] == "b"]))

#min a: 2021-06-09 00:00:00
#min b: 2021-06-15 00:00:00

To do this in a dataframe, as you asked, this is one option:

import pandas as pd 

df = pd.DataFrame(list(zip(a0,b)), columns = ["date", "letter"])
print(min(df.loc[df["letter"].eq("a")]["date"]))
print(min(df.loc[df["letter"].eq("b")]["date"]))

#2021-06-09 00:00:00
#2021-06-15 00:00:00

All that said, you don't have to use datetime although I would recommend it.

import pandas as pd

print("min a:", min([a[i] for i in range(len(a0)) if b[i] == "a"]))
print("min b:", min([a[i] for i in range(len(a0)) if b[i] == "b"]))
#min a: 2021-06-09 00:00:00
#min b: 2021-06-15 00:00:00


df = pd.DataFrame(list(zip(a,b)), columns = ["date", "letter"])
print(min(df.loc[df["letter"].eq("a")]["date"]))
print(min(df.loc[df["letter"].eq("b")]["date"]))
#2021-06-09 00:00:00
#2021-06-15 00:00:00

CodePudding user response：

Another method would be to group by the second column (indexed as 1) and get the minimum of from the first as:

fecha.groupby([1]).agg('min')

The output is:

           0
1           
a 2021-06-09
b 2021-06-15