I have a dataframe with two columns (more but only two of interest) such as below:
Column 1 | Column 2 |
---|---|
Doesn't | A1 |
Matter | A2 |
What | A1 |
Is | A2 |
In | A3 |
This | A2 |
Column | A4 |
But | A2 |
I | A1 |
Do | A2 |
Need | A3 |
It | A2 |
In | A4 |
Output | A5 |
And I simply want to get a list of all unique values in column 2 - however - I want 3 instances/examples of each value (it doesn't have to be first 3 but first 3 will work just as fine).
I know how to get unique elements of a dataframe obviously with .unique() method but not 3 repeats of each unique value (up to 3 so if only 1 or 2 exists - I want to get just get those).
And I want to get
Column 1 | Column 2 |
---|---|
Doesn't | A1 |
What | A1 |
I | A1 |
Matter | A2 |
Is | A2 |
This | A2 |
In | A3 |
Need | A3 |
Column | A4 |
In | A4 |
Output | A5 |
The order of column 2 does not matter. I do want to get the unique terms in adjacent rows (grouped together) but there is no sequential order in between groups that matters. So the Set of A3 could come before the set of A1 rows.
CodePudding user response:
You can use grouping (groupby manual) and then take only n first elements of each group (groupby.head manual).
Let's create the data:
data = {'Column 1': ['Doesn\'t', 'Matter', 'What', 'Is', 'In', 'This', 'Column', 'But', 'I', 'Do', 'Need', 'It', 'In', 'Output'], 'Column 2': ['A1', 'A2', 'A1', 'A2', 'A3', 'A2', 'A4', 'A2', 'A1', 'A2', 'A3', 'A2', 'A4', 'A5']}
df = pd.DataFrame(data)
Group it, take first 3 elements of each group, sort by Column 2 values:
df.groupby(by='Column 2').head(3).sort_values(by='Column 2')
CodePudding user response:
You can use the rank
function to rank rows by partitioning on "Column 2" values. Filtering on rank values less than 4 for each partition may get what you need:
df.loc[df.groupby(['Column 2'])['Column 2'].rank(method='first')<=3]
Input sample used for testing:
df = pd.DataFrame([
["Doesn't", "A1"],
["Matter" , "A2"],
["What" , "A1"],
["Is" , "A2"],
["In" , "A3"],
["This" , "A2"],
["Column" , "A4"],
["But" , "A2"],
["I" , "A1"],
["Do" , "A2"],
["Need" , "A3"],
["It" , "A2"],
["In" , "A4"],
["Output" , "A5"],
], columns = ["Column 1", "Column 2"])