In continuation of my question
There is a table in a CSV
file format:
A | B |
---|---|
35480007 | 0695388 |
35480007 | 0695388 |
35407109 | 3324741 |
35407109 | 3324741 |
35250208 | 0695388 |
35250208 | 6104556 |
86730903 | 3360935 |
86730903 | 3360935 |
By applying the code for aggregation
:
df.groupby("B")["A"].unique()
I get the result:
695388 [35480007, 35250208]
3324741 [35407109]
3360935 [86730903]
6104556 [35250208]
Could you tell me please, how can I apply some kind of filter so that only those values that have a value greater than two
can be displayed: that is so:
695388 [35480007, 35250208]
and how to save the result to a file, for example in txt.
I apologize in advance if my question seemed incorrect. I am very weak in the pandas
library.
thank you very much!
CodePudding user response:
It took me a second to realize that what you mean is not a value greater than two, but rather a length greater than one (or greather than or equal to two).
With that said, you can use the apply
function on your Series
to see which rows satisfy this property
grouped = df.groupby("B")["A"].unique()
has_multiple_elements = grouped.apply(lambda x: len(x)>1)
Which basically applies a function to each entry in your grouped
series, and returns the following:
695388 True
3324741 False
3360935 False
6104556 False
Now all that's left is to use these True/False
boolean values to filter your series. Luckily, this is very simple.
result = grouped[has_multiple_elements]
As for the second part of your question, writing this to a file can be done using the to_csv
function:
# I usually use tab separated files in case any commas appear in your data itself
result.to_csv('output.tsv', sep='\t')