Home > database >  How to produce a Chi-square distribution table With p-values for three vectors (two categorical and
How to produce a Chi-square distribution table With p-values for three vectors (two categorical and

Time:10-06

Issue

I have a list with three vectors where two are categorical (Whistle_Type and Country) and one is numeric (counts of whistle types A-F) (see below), which I produced using dplyr() with the count() function (see below). I want to run a Chi-Square test to determine if there are any significant differences between whistle types among the countries Germany and France

I want to create a distribution table using a function showing the p-values to conduct a Chi-square test. I would like to produce something like this.

Desired Distribution table with p-values

                          A  B  C  D  E  F
                   France p  p  p  p  p  p
                  Germany p  p  p  p  p  p

                   *p stands for p-values

I can't quite figure out how to manipulate the function to produce the outcome that I would like. I don't understand this error message as I am incorporating both a dataframe and list into the function

Error in model.frame.default(formula = as.formula(paste(x, " ~ Country")),  : 
  'data' must be a data.frame, environment, or list
Called from: model.frame.default(formula = as.formula(paste(x, " ~ Country")), 
    data = Count.Whistle.type_ChiSq$n)

If anyone is able to help (see the reproducible data frame below), I would be deeply appreciative

R code

Produce a list showing counts of whistle types per country using the function count()

Count.Whistle.type_ChiSq <- Whistle_Parameters %>% dplyr::count(Whistle_Type, Country)
Count.Whistle.type_ChiSq

List of counts of whistle types per country

       Whistle_Type Country   n
1             A      France  90
2             A      Germany 70
3             B      France  34
4             B      Germany 10
5             C      France  24
6             C      Germany  9
7             D      France  44
8             D      Germany 25
9             E      France  21
10            E      Germany 39
11            F      France  25
12            F      Germany 32
    

Chi-Square function

#List of acoustic parameters to conduct a Chi-squre test
Outcomes_Whistle_Types<-c("A", "B","C", "D", "E", "F")

#Eliminate the duplicate rows present in the vector country
Country <- unique(Parameters$Country)

#Prodcue a distribution table with p-values for the Chi Square test
Chi_Whistle<-sapply(Outcomes_Whistle_Types, \(x) chisq.test(xtabs(as.formula(paste(x, ' ~ Country')), Count.Whistle.type_ChiSq$n))$p.value)

#Set the names for the columns and rows in the distribution table 
chi_Country <- setNames(Chi_Whistle, Country)

#Chi-Square test
chi_Square_results<-lapply(chi_Country, chisq.test)
chi_Square_results

Many thanks in advance

Reproducible Dataframe

#Dummy data
#Create a cluster column with dummy data (clusters = 3)
f1 <- gl(n = 2, k=167.5); f1

#Produce a data frame for the dummy level data
f2<-as.data.frame(f1)

#Rename the column f2
colnames(f2)<-"Country"

#How many rows
nrow(f2)

#Rename the levels of the dependent variable 'Country' as classifiers
#prefer the inputs to be factors
levels(f2$Country) <- c("France", "Germany")

#Add a vector called Whistle Types
#Add a vector called Behaviors
Whistle_Types<-sample(c('A', 'B', 'C', 'D',
                     'E', 'F'), 335, replace=TRUE)

#Create random numbers
Start.Freq<-runif(335, min=1.195110e 02, max=23306.000000)
End.Freq<-runif(335, min=3.750000e 02, max=65310.000000)
Delta.Time<-runif(335, min=2.192504e-02, max=3.155762)
Low.Freq<-runif(335, min=6.592500e 02, max=20491.803000)
High.Freq<-runif(335, min=2.051000e 03, max=36388.450000)
Peak.Freq<-runif(335, min=7.324220 02, max=35595.703000)
Center.Freq<-runif(335, min=2.190000e-02, max=3.155800)
Delta.Freq<-runif(335, min=1.171875 03, max=30761.719000)
Delta.Time<-runif(335, min=2.192504e-02, max=3.155762)

#Bind the columns together
Bind<-cbind(f2, Start.Freq, End.Freq,  Low.Freq, High.Freq, Peak.Freq,  Center.Freq, Delta.Freq, Delta.Time, Whistle_Types)

#Rename the columns 
colnames(Bind)<-c('Country', 'Low.Freq', 'High.Freq', 'Start.Freq', 'End.Freq', 'Peak.Freq', 'Center.Freq', 
                  'Delta.Freq', 'Delta.Time',"Whistle_Type")

#Produce a dataframe
Whistle_Parameters<-as.data.frame(Bind)

CodePudding user response:

To be honest, I'm not sure about your desired output. What p-values do you want to show for each combination of country x whistle type?

We can easily calculate one p-value which tests the hypothesis whether there are difference in the distribution of whistle type by country.

This is similar to the first example in the docs of ?chisq.test().

For this we just need the Whistle_Parameters data and we can use table() to create a contingency table which we can then use as input for chisq.test().

We can find the first example of the docs in ?chisq.test() in Agresti, A. (2007) on page 38.

freq_tbl <- table(Whistle_Parameters$Country, Whistle_Parameters$Whistle_Type) 
freq_tbl
#>          
#>            A  B  C  D  E  F
#>   France  28 24 29 25 24 38
#>   Germany 35 32 21 19 40 20

chisq.test(freq_tbl)
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  freq_tbl
#> X-squared = 13.602, df = 5, p-value = 0.01834

The random data with set.seed()

set.seed(123)
#Dummy data
#Create a cluster column with dummy data (clusters = 3)
f1 <- gl(n = 2, k=167.5); f1
#>   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [334] 2 1
#> Levels: 1 2

#Produce a data frame for the dummy level data
f2<-as.data.frame(f1)

#Rename the column f2
colnames(f2)<-"Country"

#How many rows
nrow(f2)
#> [1] 335

#Rename the levels of the dependent variable 'Country' as classifiers
#prefer the inputs to be factors
levels(f2$Country) <- c("France", "Germany")

#Add a vector called Whistle Types
#Add a vector called Behaviors
Whistle_Types<-sample(c('A', 'B', 'C', 'D',
                        'E', 'F'), 335, replace=TRUE)

#Create random numbers
Start.Freq<-runif(335, min=1.195110e 02, max=23306.000000)
End.Freq<-runif(335, min=3.750000e 02, max=65310.000000)
Delta.Time<-runif(335, min=2.192504e-02, max=3.155762)
Low.Freq<-runif(335, min=6.592500e 02, max=20491.803000)
High.Freq<-runif(335, min=2.051000e 03, max=36388.450000)
Peak.Freq<-runif(335, min=7.324220 02, max=35595.703000)
Center.Freq<-runif(335, min=2.190000e-02, max=3.155800)
Delta.Freq<-runif(335, min=1.171875 03, max=30761.719000)
Delta.Time<-runif(335, min=2.192504e-02, max=3.155762)

#Bind the columns together
Bind<-cbind(f2, Start.Freq, End.Freq,  Low.Freq, High.Freq, Peak.Freq,  Center.Freq, Delta.Freq, Delta.Time, Whistle_Types)

#Rename the columns 
colnames(Bind)<-c('Country', 'Low.Freq', 'High.Freq', 'Start.Freq', 'End.Freq', 'Peak.Freq', 'Center.Freq', 
                  'Delta.Freq', 'Delta.Time',"Whistle_Type")

#Produce a dataframe
Whistle_Parameters<-as.data.frame(Bind)

Created on 2022-10-06 by the reprex package (v2.0.1)

  • Related