Pandas Scipy mannwhitneyu in this type of data table-CodePudding

I have a data table similar to this one (but huge), many types and more "Spot" cells for each "Color":

Type    Color   Spots
A   Blue    792
A   Blue    56
A   Blue    2726
A   Blue    780
A   Blue    591
A   Blue    2867
A   Blue    193
A   Green   134
A   Green   631
A   Green   1010
A   Green   53
A   Green   5826
A   Green   6409
A   Green   3278
B   Blue    670
B   Blue    42
B   Blue    1165
B   Blue    3203
B   Blue    2164
B   Blue    5876
B   Blue    525
B   Green   26
B   Green   143
B   Green   399
B   Green   68
B   Green   939
B   Green   1528
B   Green   401
B   Green   1842
C   Blue    265
C   Blue    19
C   Blue    1381
C   Blue    4483
C   Blue    1103
C   Blue    1906
C   Blue    691
C   Green   38
C   Green   149
C   Green   87
C   Green   33
C   Green   1427
C   Green   1009
C   Green   342
C   Green   190

I want to do a Scipy mannwhitneyu analysis comparing Blue vs Green spots of each type, for instance for type A, this comparison and the same for all the types automatically:

I thought that defining those kind of groups in Pandas and then calling them in scipy should be the strategy, but my skills are not at that level still. The idea is do it automatically for of the types, so I get the p-value of A, B, C, etc. Could somebody give me a hint? Thanks

CodePudding user response：

Your questions might be leaving a lot that is obvious to you implied for people who are not as familiar with the sort of statistical analysis you are interested in. That might be making it difficult to help you along, but by trying to cover all my bases, I think I might be able to help you regardless.

The first implied assumption appears to be that, when you select by type, the number of rows (which I do not assume to be consecutive in my code) you have for each type match in number. In your example, there are seven rows for each type, for each color, but you also say that there will be more in your final data. As long as the number of rows match, the following should apply regardless. If they do not match, you can insert a check for the length with len(), but at the very latest the Whitney-Mann function from scipy will probably complain about x and y not being the same length. What you would like to do, it seems, is to select by group from your Pandas dataframe, which is a common operation. For example, to query all types from your dataframe, you could write:

set(df['Type'])

Then, you can iterate over those types:

for t in types:
   # ...

Similar for your colors.

You wanting to do a Mann-Whitney test probably also implies that it's always two pairs of lists you want to compare, one green and one blue. I assume that those are the only colors, but if you have more combinations then you can generate them as such:

p = itertools.combinations(['Blue', 'Red', 'Green', 'Yellow'], 2)
print(*p)

This will print:

('Blue', 'Red') ('Blue', 'Green') ('Blue', 'Yellow') ('Red', 'Green') ('Red', 'Yellow') ('Green', 'Yellow')

I do not know if you consider this over-engineering, but usually hardcoding fewer assumptions and allowing for more flexibility is best, unless it also admits cases that should be caught as errors in your data. Hardcode "Blue" and "Green" if you prefer that. Other readers of this question might be interested in the more general approach. You could do the same with "Spots" and avoid hardcoding even of that variable, in case you have more than just metric, but it should be clear from here to extend the approach.

The next hint that might help you is to select, in such a loop, by two criteria, which will be your grouping criteria:

 df[(df['Type'] == t) & (df['Color'] == 'Green')]['Spots']

The expression within df[...] is a boolean array. That is in turn used to select all rows corresponding to true values within that boolean array. This is one of the main techniques by which more complex queries are typically built with Pandas dataframes and might be the key insight you have been looking for.

Do the same with the blue spots. Then, if you want to align two sequences of the same length, there is a builtin function called zip:

zip(green, blue)

It will give you a generator like object, so convert it to a list if that is undesirable. The lists passed do not need to be of equal length and the result length of the result will be that of the shortest of the lists passed to zip. Even though your data appears to have the same number of measurements for each type and color, it does not seem to be a requirement for the Mann-Whitney test.

Depending on how you read your data, it might also be useful to you to know how to convert types in tables. You might not need this step, if your data types area already numeric, but by mentioning it I cover all my bases and might spare you another roadblock later down the line:

green = df[(df['Type'] == t) & (df['Color'] == 'Green')]['Spots'].astype('int64')

Putting all of this together, you might want to write something like this:

import pandas as pd
import scipy.stats
import itertools

records = [
    ('A', 'Blue', '792'),
    ('A', 'Blue', '56'),
    ('A', 'Blue', '2726'),
    ('A', 'Blue', '780'),
    ('A', 'Blue', '591'),
    ('A', 'Blue', '2867'),
    ('A', 'Blue', '193'),
    ('A', 'Green', '134'),
    ('A', 'Green', '631'),
    ('A', 'Green', '1010'),
    ('A', 'Green', '53'),
    ('A', 'Green', '5826'),
    ('A', 'Green', '6409'),
    ('A', 'Green', '3278'),
    ('B', 'Blue', '670'),
    ('B', 'Blue', '42'),
    ('B', 'Blue', '1165'),
    ('B', 'Blue', '3203'),
    ('B', 'Blue', '2164'),
    ('B', 'Blue', '5876'),
    ('B', 'Blue', '525'),
    ('B', 'Green', '26'),
    ('B', 'Green', '143'),
    ('B', 'Green', '399'),
    ('B', 'Green', '68'),
    ('B', 'Green', '939'),
    ('B', 'Green', '1528'),
    ('B', 'Green', '401'),
    ('B', 'Green', '1842'),
    ('C', 'Blue', '265'),
    ('C', 'Blue', '19'),
    ('C', 'Blue', '1381'),
    ('C', 'Blue', '4483'),
    ('C', 'Blue', '1103'),
    ('C', 'Blue', '1906'),
    ('C', 'Blue', '691'),
    ('C', 'Green', '38'),
    ('C', 'Green', '149'),
    ('C', 'Green', '87'),
    ('C', 'Green', '33'),
    ('C', 'Green', '1427'),
    ('C', 'Green', '1009'),
    ('C', 'Green', '342'),
    ('C', 'Green', '190'),
]

df = pd.DataFrame(records, columns=['Type', 'Color', 'Spots'])

types = set(df['Type'])
print(f'Types:\n{types}\n')

colors = set(df['Color'])
print(f'Colors:\n{colors}\n')

groups = list(itertools.combinations(colors, 2))

for t in types:
    for c in groups:
        spots1 = df[(df['Type'] == t) & (df['Color'] == c[0])]['Spots'].astype('int64')
        spots2 = df[(df['Type'] == t) & (df['Color'] == c[1])]['Spots'].astype('int64')

        print(f'{c[0]}:\n{spots1}\n')
        print(f'{c[1]}:\n{spots2}\n')

        # Do you want to align these sequences before the test? 
        # Then, zip might be useful.
        for cp in zip(spots1, spots2):
            print(cp)
        print()

        mwu = scipy.stats.mannwhitneyu(spots1, spots2)
        print(f'Type {t}: {c[0]} vs. {c[1]} statistic = {mwu.statistic}, p-value = {mwu.pvalue}')

This will result in the following statistics:

Type B: Green vs. Blue statistic = 15.0, p-value = 0.151981351981352
Type C: Green vs. Blue statistic = 15.0, p-value = 0.151981351981352
Type A: Green vs. Blue statistic = 30.0, p-value = 0.534965034965035

I am not sure if that is what you are looking for, but I hope this will at least help you towards getting your data transformed in such a way that you can make progress on your issue. For your validation, here is how my code would line up your data:

for p in zip(green, blue):
    print(p)

The pairs this will generate:

(134, 792)
(631, 56)
(1010, 2726)
(53, 780)
(5826, 591)
(6409, 2867)
(3278, 193)

(26, 670)
(143, 42)
(399, 1165)
(68, 3203)
(939, 2164)
(1528, 5876)
(401, 525)

(38, 265)
(149, 19)
(87, 1381)
(33, 4483)
(1427, 1103)
(1009, 1906)
(342, 691)

I am aware that there is some redundant code in here, but I wanted to make sure to touch on everything relevant. If there are more colors than green and blue then, as I had mentioned, you want to generate all possible pairs. I have left out how to do this, for now, but I can add it if that was meant to be part of your question. Depending on how you read your data and how it is formatted, you might also need to adjust the step that transforms the records for your analysis into a Pandas data frame, or remove those lines entirely, if you already have that step covered. I assume everything to be strings, as it might be the case when data is read from a CSV file, lacking type information.

If you need to read the data, formatted the way you present it, from a CSV file, you could use the following. A separator of '\s ' will match all whitespace. So, a compact version stripped of all the explanatory code and the data records could look like this:

import pandas
import scipy.stats
import itertools

data = pandas.read_csv('data.csv', sep='\s ')
types = set(data[data.columns[0]])
colors = set(data[data.columns[1]])
groups = list(itertools.combinations(colors, 2))

for t in types:
    for c in groups:
        sel_x = (data[data.columns[0]] == t) & (data[data.columns[1]] == c[0])
        x = data[sel_x][data.columns[2]].astype('int64')
        sel_y = (data[data.columns[0]] == t) & (data[data.columns[1]] == c[1])
        y = data[sel_y][data.columns[2]].astype('int64')
        print(f'{t} ({c[0]} vs. {c[1]}): {scipy.stats.mannwhitneyu(x, y)}')

Output:

B (Green vs. Blue): MannwhitneyuResult(statistic=15.0, pvalue=0.151981351981352)
A (Green vs. Blue): MannwhitneyuResult(statistic=30.0, pvalue=0.534965034965035)
C (Green vs. Blue): MannwhitneyuResult(statistic=15.0, pvalue=0.151981351981352)

As you can see, they might be out of order, which is due to us using a set. I assume that to not be an issue, but if it is, sort them first:

types = sorted(set(df['Type']))

You can also use Panda's groupby and unique. But I do not think there is a straightforward way to select all pairs of groups, which it appears you need. However, Pandas does have the ability to specify more than one column to group by.

import pandas
import scipy.stats
import itertools

df = pandas.read_csv('data.csv', sep='\s ')
cp = list(itertools.combinations(df[df.columns[1]].unique(), 2))
for key, group in df.groupby(df.columns[0]):
    for c in cp:
        x = group[group[df.columns[1]] == c[0]][df.columns[2]].astype('int64')
        y = group[group[df.columns[1]] == c[1]][df.columns[2]].astype('int64')
        print(f'Type {key} ({c[0]} vs {c[1]}): {scipy.stats.mannwhitneyu(x, y)}')

Documentation for groupby can be found under https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html.

I am rusty with the statistics part, so I might not be able to help you with that right away. However, you seem to be in the know on that part, so if I can help you out with the Python portion and you fill me in where I might get the statistical analysis wrong, I am confident that we can solve this.

For other readers, the documentation for the scipy implementation can be found under https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html:

The Mann-Whitney U test is a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. It is often used as a test of difference in location between distributions.

More explanation for the Mann-Whitney test can be found under https://en.wikipedia.org/wiki/Mann–Whitney_U_test. I hope that including these references will make both question and answer more useful to other readers and easier to follow along. Roughly speaking, what you are probably interested in are the statistical differences in occurrence of green and blue spots between different types of objects being observed. Discussing the applicability of this statistic, given the nature and distribution of the data, I understand to be outside the scope of this question.