One of the most desired goals of the modern Animal/Plant Breeding or Human Genetics Research is to map genes that affect important quantitative traits. Traits such as: mastitis resistance (animal breeding), starch composition (plant breeding) or obesity (human research) are important traits and their comprehension may have a huge effect in the society. However, such traits have a complex nature since they are normally regulated by multiple genes (not Mendelian).
With the emergence of high-throughput genotyping technologies and the increase in number and density of genotyped markers (SNPs) it has been possible to execute genome-wide association studies (GWAS) in different organisms. In this post I have been using a simulated phenotypic data and human genotypes from open.snp.orgs (plublic available data). The intention with this post is to show a very simple pipeline for running GWAS analysis using PLINK. More complex models (that takes into account population structure or familial relatedness may come in the next posts). I will consider that you have already the files ready to input. It consists of 3 files:
[file].bed: contains the genotype information in binary format
[file].bim: contains the chromosome, position and alleles of all the SNPs in the data set.
[file].fam: contains father ID, mother ID, gender and phenotype for each sample
This data is compost of 1101 people (0 males, 1101 females), where the total genotyping rate is 0.999987 and have a total of 191656 variants before any quality control. Among the phenotypes, 531 are cases and 570 are controls.
One of the more straightforward analysis that one could do with the case control arranged data is to execute a Contingency Table test of independence. Both Chi square and Fisher statistics could be used. But if the sample size is small, one would use the Fisher’s exact test, because it gives the exact probability of any arrangement in the table.
Therefore, to test for association between SNPs and disease status using an allelic Fisher’s
exact test, type the following command at the shell prompt:
plink –bfile project1 –fisher –out raw
sort -gk8 raw.assoc.fisher | head -15
Many false positives may be produced because we have set many statistical inferences simultaneously, therefore we need to correct for multiple tests .
Bonferroni Correction or FDR:
Bonferroni correction is a very conservative method, because it gives a very hard threshold for correction. It basically normalizes the alpha significance level by the number of tests that are executed. So if you have 1000 snps and you are doing single marker analysis, each marker is gonna be tested each time, therefore the alpha significance after correction is: 0.01/1000 tests = 0.00001.
The plink command to do this test is:
plink –bfile project1 –fisher –adjust –out adj
For more details of the output you can access link: multiple test
As you can see from the image above, the unique significant snp is rs7655371 (p-value = 0.00153), which is localized in chromosome 4.
Make QQ plot and Manhattan plot
Two standard ways of visualizing the results from GWAS is by producing the QQ-plot and Manhattam plot. You could do it in R, python or any language that you feel comfortable with. The qqman R package is my suggestion to produce them. And a very nice tutorial about how it is found here.
The figure is produced by plotting the negative logarithms of the P-values from the models fitted in GWAS against their expected value under the null hypothesis of no association with the trait (red line).
It’s expected that most of the SNPs are not associated with the trait and deviations from the red line suggest presence of spurious associations due to population structure and/or familial relatedness. It seems that there is an inflation (false positives). Therefore, we may need to do some Quality Control to reduce the noisy.
The Manhattan plot is a way to visualize the position across genome of the significant markers based on their p-values.
The P values are represented in genomic order by chromosome and position on the chromosome (x-axis). From this plot, if we consider the threshold of 8 (10^-8) is evident that the unique chromosome with QTL effects is the chromosome 4. However, if we relax our threshold a bit we may find more significant SNPs.
This is not the end of the analysis, we may still need to clean up our data for rare alleles (MAF > 0.05), exclude those SNPs that are not in HWE and may exclude snp or individuals with too many missing genotypes. Next post I will tell you how to do it using PLINK.