Home > Software engineering >  Correct display of data after aggregation
Correct display of data after aggregation

Time:02-18

In continuation of my question

There is a table in a CSV file format:

A B
35480007 0695388
35480007 0695388
35407109 3324741
35407109 3324741
35250208 0695388
35250208 6104556
86730903 3360935
86730903 3360935

By applying the code for aggregation:

df.groupby("B")["A"].unique()

I get the result:

695388     [35480007, 35250208]
3324741              [35407109]
3360935              [86730903]
6104556              [35250208]

Could you tell me please, how can I apply some kind of filter so that only those values that have a value greater than two can be displayed: that is so:

695388     [35480007, 35250208]

and how to save the result to a file, for example in txt.

I apologize in advance if my question seemed incorrect. I am very weak in the pandas library.

thank you very much!

CodePudding user response:

It took me a second to realize that what you mean is not a value greater than two, but rather a length greater than one (or greather than or equal to two).

With that said, you can use the apply function on your Series to see which rows satisfy this property

grouped = df.groupby("B")["A"].unique()
has_multiple_elements = grouped.apply(lambda x: len(x)>1)

Which basically applies a function to each entry in your grouped series, and returns the following:

695388      True
3324741    False
3360935    False
6104556    False

Now all that's left is to use these True/False boolean values to filter your series. Luckily, this is very simple.

result = grouped[has_multiple_elements]

As for the second part of your question, writing this to a file can be done using the to_csv function:

# I usually use tab separated files in case any commas appear in your data itself
result.to_csv('output.tsv', sep='\t')
  • Related