I am trying to handle the following dataframe
import pandas as pd
import io
csv_data = '''
ID,age,Year
100,75,2020
100,76,2021
200,64,2020
200,65,2021
200,66,2022
300,69,2020
300,70,2021
300,71,2022
300,72,2023
'''
df = pd.read_csv(io.StringIO(csv_data))
df = df.set_index(['ID', 'age'])
df
Year
ID age
100 75 2020
76 2021
200 64 2020
65 2021
66 2022
300 69 2020
70 2021
71 2022
72 2023
The ID in this data frame represents the same person, and Year means the year of the visit.
This will be multi-indexed based on ID and age. At this point, is it possible to display from the ID with more age elements?
The ideal display is shown below.
Year
ID age
300 69 2020
70 2021
71 2022
72 2023
200 64 2020
65 2021
66 2022
100 75 2020
76 2021
CodePudding user response:
Unless it's a requirement you immediately set_index
after your .read_csv
, then the following should be reasonably okay...
Get a count of how many times an ID occurs - we can use .value_counts
for this which handily automatically sorts by descending order...
id_freq = df['ID'].value_counts()
Then index your DF and .reindex
using the index of id_freq
, eg:
df = df.set_index(['ID', 'age']).reindex(id_freq.index, level=0)
This'll give you:
Year
ID age
300 69 2020
70 2021
71 2022
72 2023
200 64 2020
65 2021
66 2022
100 75 2020
76 2021
Might also have a side effect of being useful as you could also run id_freq.value_counts()
to get a distribution of how many times each patient in general appears.
If you do have to index from the start, then you might as well provide it to .read_csv
, such as:
df = pd.read_csv(io.StringIO(csv_data), index_col=['ID', 'age'])
Then similar to above, reindex but on the level values of the first level of the index, eg:
df = df.reindex(df.index.get_level_values(0).value_counts().index, level=0)
CodePudding user response:
You can first compute the length of each ID
group using groupby
and size
, and then sort the ID
values based on the length:
s = df.groupby('ID').size()
df.sort_values('ID', key=lambda i:s[i], ascending=False)
Output:
Year
ID age
300 69 2020
70 2021
71 2022
72 2023
200 64 2020
65 2021
66 2022
100 75 2020
76 2021
CodePudding user response:
You can sort based on two columns and set which column ascending and which not ascending like below:
>>> print(df.sort_values(['ID', 'age'], ascending=[False, True]))
Year
ID age
300 69 2020
70 2021
71 2022
72 2023
200 64 2020
65 2021
66 2022
100 75 2020
76 2021
CodePudding user response:
You could sort these values like this:
df = df.sort_values(['ID', 'age'],
ascending = [False, True])
CodePudding user response:
pandas sort_index()
is useful for sorting dataframe by index values.
df.sort_index(ascending=[False, True])