I have a data table similar to this one (but huge), many types and more "Spot" cells for each "Color":
Type Color Spots
A Blue 792
A Blue 56
A Blue 2726
A Blue 780
A Blue 591
A Blue 2867
A Blue 193
A Green 134
A Green 631
A Green 1010
A Green 53
A Green 5826
A Green 6409
A Green 3278
B Blue 670
B Blue 42
B Blue 1165
B Blue 3203
B Blue 2164
B Blue 5876
B Blue 525
B Green 26
B Green 143
B Green 399
B Green 68
B Green 939
B Green 1528
B Green 401
B Green 1842
C Blue 265
C Blue 19
C Blue 1381
C Blue 4483
C Blue 1103
C Blue 1906
C Blue 691
C Green 38
C Green 149
C Green 87
C Green 33
C Green 1427
C Green 1009
C Green 342
C Green 190
I want to do a Scipy mannwhitneyu analysis comparing Blue vs Green spots of each type, for instance for type A, this comparison and the same for all the types automatically:
Blue Green
792 134
56 631
2726 1010
780 53
591 5826
2867 6409
193 3278
I thought that defining those kind of groups in Pandas and then calling them in scipy should be the strategy, but my skills are not at that level still. The idea is do it automatically for of the types, so I get the p-value of A, B, C, etc. Could somebody give me a hint? Thanks
CodePudding user response:
Your questions might be leaving a lot that is obvious to you implied for people who are not as familiar with the sort of statistical analysis you are interested in. That might be making it difficult to help you along, but by trying to cover all my bases, I think I might be able to help you regardless.
The first implied assumption appears to be that, when you select by type, the number of rows (which I do not assume to be consecutive in my code) you have for each type match in number. In your example, there are seven rows for each type, for each color, but you also say that there will be more in your final data. As long as the number of rows match, the following should apply regardless. If they do not match, you can insert a check for the length with len()
, but at the very latest the Whitney-Mann function from scipy will probably complain about x and y not being the same length. What you would like to do, it seems, is to select by group from your Pandas dataframe, which is a common operation. For example, to query all types from your dataframe, you could write:
set(df['Type'])
Then, you can iterate over those types:
for t in types:
# ...
Similar for your colors.
You wanting to do a Mann-Whitney test probably also implies that it's always two pairs of lists you want to compare, one green and one blue. I assume that those are the only colors, but if you have more combinations then you can generate them as such:
p = itertools.combinations(['Blue', 'Red', 'Green', 'Yellow'], 2)
print(*p)
This will print:
('Blue', 'Red') ('Blue', 'Green') ('Blue', 'Yellow') ('Red', 'Green') ('Red', 'Yellow') ('Green', 'Yellow')
I do not know if you consider this over-engineering, but usually hardcoding fewer assumptions and allowing for more flexibility is best, unless it also admits cases that should be caught as errors in your data. Hardcode "Blue"
and "Green"
if you prefer that. Other readers of this question might be interested in the more general approach. You could do the same with "Spots"
and avoid hardcoding even of that variable, in case you have more than just metric, but it should be clear from here to extend the approach.
The next hint that might help you is to select, in such a loop, by two criteria, which will be your grouping criteria:
df[(df['Type'] == t) & (df['Color'] == 'Green')]['Spots']
The expression within df[...]
is a boolean array. That is in turn used to select all rows corresponding to true
values within that boolean array. This is one of the main techniques by which more complex queries are typically built with Pandas dataframes and might be the key insight you have been looking for.
Do the same with the blue spots. Then, if you want to align two sequences of the same length, there is a builtin function called zip
:
zip(green, blue)
It will give you a generator like object, so convert it to a list if that is undesirable. The lists passed do not need to be of equal length and the result length of the result will be that of the shortest of the lists passed to zip
. Even though your data appears to have the same number of measurements for each type and color, it does not seem to be a requirement for the Mann-Whitney test.
Depending on how you read your data, it might also be useful to you to know how to convert types in tables. You might not need this step, if your data types area already numeric, but by mentioning it I cover all my bases and might spare you another roadblock later down the line:
green = df[(df['Type'] == t) & (df['Color'] == 'Green')]['Spots'].astype('int64')
Putting all of this together, you might want to write something like this:
import pandas as pd
import scipy.stats
import itertools
records = [
('A', 'Blue', '792'),
('A', 'Blue', '56'),
('A', 'Blue', '2726'),
('A', 'Blue', '780'),
('A', 'Blue', '591'),
('A', 'Blue', '2867'),
('A', 'Blue', '193'),
('A', 'Green', '134'),
('A', 'Green', '631'),
('A', 'Green', '1010'),
('A', 'Green', '53'),
('A', 'Green', '5826'),
('A', 'Green', '6409'),
('A', 'Green', '3278'),
('B', 'Blue', '670'),
('B', 'Blue', '42'),
('B', 'Blue', '1165'),
('B', 'Blue', '3203'),
('B', 'Blue', '2164'),
('B', 'Blue', '5876'),
('B', 'Blue', '525'),
('B', 'Green', '26'),
('B', 'Green', '143'),
('B', 'Green', '399'),
('B', 'Green', '68'),
('B', 'Green', '939'),
('B', 'Green', '1528'),
('B', 'Green', '401'),
('B', 'Green', '1842'),
('C', 'Blue', '265'),
('C', 'Blue', '19'),
('C', 'Blue', '1381'),
('C', 'Blue', '4483'),
('C', 'Blue', '1103'),
('C', 'Blue', '1906'),
('C', 'Blue', '691'),
('C', 'Green', '38'),
('C', 'Green', '149'),
('C', 'Green', '87'),
('C', 'Green', '33'),
('C', 'Green', '1427'),
('C', 'Green', '1009'),
('C', 'Green', '342'),
('C', 'Green', '190'),
]
df = pd.DataFrame(records, columns=['Type', 'Color', 'Spots'])
types = set(df['Type'])
print(f'Types:\n{types}\n')
colors = set(df['Color'])
print(f'Colors:\n{colors}\n')
groups = list(itertools.combinations(colors, 2))
for t in types:
for c in groups:
spots1 = df[(df['Type'] == t) & (df['Color'] == c[0])]['Spots'].astype('int64')
spots2 = df[(df['Type'] == t) & (df['Color'] == c[1])]['Spots'].astype('int64')
print(f'{c[0]}:\n{spots1}\n')
print(f'{c[1]}:\n{spots2}\n')
# Do you want to align these sequences before the test?
# Then, zip might be useful.
for cp in zip(spots1, spots2):
print(cp)
print()
mwu = scipy.stats.mannwhitneyu(spots1, spots2)
print(f'Type {t}: {c[0]} vs. {c[1]} statistic = {mwu.statistic}, p-value = {mwu.pvalue}')
This will result in the following statistics:
Type B: Green vs. Blue statistic = 15.0, p-value = 0.151981351981352
Type C: Green vs. Blue statistic = 15.0, p-value = 0.151981351981352
Type A: Green vs. Blue statistic = 30.0, p-value = 0.534965034965035
I am not sure if that is what you are looking for, but I hope this will at least help you towards getting your data transformed in such a way that you can make progress on your issue. For your validation, here is how my code would line up your data:
for p in zip(green, blue):
print(p)
The pairs this will generate:
(134, 792)
(631, 56)
(1010, 2726)
(53, 780)
(5826, 591)
(6409, 2867)
(3278, 193)
(26, 670)
(143, 42)
(399, 1165)
(68, 3203)
(939, 2164)
(1528, 5876)
(401, 525)
(38, 265)
(149, 19)
(87, 1381)
(33, 4483)
(1427, 1103)
(1009, 1906)
(342, 691)
I am aware that there is some redundant code in here, but I wanted to make sure to touch on everything relevant. If there are more colors than green and blue then, as I had mentioned, you want to generate all possible pairs. I have left out how to do this, for now, but I can add it if that was meant to be part of your question. Depending on how you read your data and how it is formatted, you might also need to adjust the step that transforms the records for your analysis into a Pandas data frame, or remove those lines entirely, if you already have that step covered. I assume everything to be strings, as it might be the case when data is read from a CSV file, lacking type information.
If you need to read the data, formatted the way you present it, from a CSV file, you could use the following. A separator of '\s ' will match all whitespace. So, a compact version stripped of all the explanatory code and the data records could look like this:
import pandas
import scipy.stats
import itertools
data = pandas.read_csv('data.csv', sep='\s ')
types = set(data[data.columns[0]])
colors = set(data[data.columns[1]])
groups = list(itertools.combinations(colors, 2))
for t in types:
for c in groups:
sel_x = (data[data.columns[0]] == t) & (data[data.columns[1]] == c[0])
x = data[sel_x][data.columns[2]].astype('int64')
sel_y = (data[data.columns[0]] == t) & (data[data.columns[1]] == c[1])
y = data[sel_y][data.columns[2]].astype('int64')
print(f'{t} ({c[0]} vs. {c[1]}): {scipy.stats.mannwhitneyu(x, y)}')
Output:
B (Green vs. Blue): MannwhitneyuResult(statistic=15.0, pvalue=0.151981351981352)
A (Green vs. Blue): MannwhitneyuResult(statistic=30.0, pvalue=0.534965034965035)
C (Green vs. Blue): MannwhitneyuResult(statistic=15.0, pvalue=0.151981351981352)
As you can see, they might be out of order, which is due to us using a set. I assume that to not be an issue, but if it is, sort them first:
types = sorted(set(df['Type']))
You can also use Panda's groupby
and unique
. But I do not think there is a straightforward way to select all pairs of groups, which it appears you need. However, Pandas does have the ability to specify more than one column to group by.
import pandas
import scipy.stats
import itertools
df = pandas.read_csv('data.csv', sep='\s ')
cp = list(itertools.combinations(df[df.columns[1]].unique(), 2))
for key, group in df.groupby(df.columns[0]):
for c in cp:
x = group[group[df.columns[1]] == c[0]][df.columns[2]].astype('int64')
y = group[group[df.columns[1]] == c[1]][df.columns[2]].astype('int64')
print(f'Type {key} ({c[0]} vs {c[1]}): {scipy.stats.mannwhitneyu(x, y)}')
Documentation for groupby
can be found under https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html.
I am rusty with the statistics part, so I might not be able to help you with that right away. However, you seem to be in the know on that part, so if I can help you out with the Python portion and you fill me in where I might get the statistical analysis wrong, I am confident that we can solve this.
For other readers, the documentation for the scipy implementation can be found under https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html:
The Mann-Whitney U test is a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. It is often used as a test of difference in location between distributions.
More explanation for the Mann-Whitney test can be found under https://en.wikipedia.org/wiki/Mann–Whitney_U_test. I hope that including these references will make both question and answer more useful to other readers and easier to follow along. Roughly speaking, what you are probably interested in are the statistical differences in occurrence of green and blue spots between different types of objects being observed. Discussing the applicability of this statistic, given the nature and distribution of the data, I understand to be outside the scope of this question.