GWAS using PLINK-1


, , , ,

One of the most desired goals of the modern Animal/Plant Breeding or Human Genetics Research is to map genes that affect important quantitative traits. Traits such as: mastitis resistance (animal breeding), starch composition (plant breeding) or obesity (human research) are important traits and their comprehension may have a huge effect in the society. However, such traits have a complex nature since they are normally regulated by multiple genes (not Mendelian).

With the emergence of high-throughput genotyping technologies and the increase in number and density of genotyped markers (SNPs) it has been possible to execute genome-wide association studies (GWAS) in different organisms. In this post I have been using a simulated phenotypic data and human genotypes from open.snp.orgs (plublic available data).  The intention with this post is to show a very simple pipeline for running GWAS analysis using PLINK. More complex models (that takes into account population structure or familial relatedness may come in the next posts).  I will consider that you have already the files ready to input. It consists of 3 files:

[file].bed: contains the genotype information in binary format
[file].bim: contains the chromosome, position and alleles of all the SNPs in the data set.
[file].fam: contains father ID, mother ID, gender and phenotype for each sample

This data is compost of 1101 people (0 males, 1101 females), where the total genotyping rate is 0.999987 and have a total of 191656 variants before any quality control. Among the phenotypes, 531 are cases and 570 are controls.

One of the more straightforward analysis that one could do with the case control arranged data is to execute a Contingency Table test of independence. Both Chi square and Fisher statistics could be used. But if the sample size is small, one would use the Fisher’s exact test, because  it gives the exact probability of any arrangement in the table.


Therefore, to test for association between SNPs and disease status using an allelic Fisher’s
exact test, type the following command at the shell prompt:

plink –bfile project1 –fisher –out raw

sort -gk8 raw.assoc.fisher | head -15


Many false positives may be produced because we have set many statistical inferences simultaneously, therefore we need to correct for multiple tests .

Bonferroni Correction or FDR:

Bonferroni correction is a very conservative method, because it gives a very hard threshold for correction. It basically normalizes the alpha significance level by the number  of tests that are executed. So if you have 1000 snps and you are doing single marker analysis, each marker is gonna be tested each time, therefore the alpha significance after correction is: 0.01/1000 tests = 0.00001.

The plink command to do this test is:

plink –bfile project1 –fisher –adjust –out adj


For more details of the output you can access link: multiple test

As you can see from the image above, the unique significant snp is rs7655371  (p-value = 0.00153), which is localized in chromosome 4.

Make QQ plot and Manhattan plot

Two standard ways of visualizing the results from GWAS is by producing the QQ-plot and Manhattam plot. You could do it in R, python or any language that you feel comfortable with. The qqman R package is my suggestion to produce them. And a very nice tutorial about how it is found here.


The figure is produced by plotting the negative logarithms of the P-values from the models fitted in GWAS against their expected value under the null hypothesis of no association with the trait (red line).


It’s expected that most of the SNPs are not associated with the trait and deviations from the red line suggest presence of spurious associations due to population structure and/or familial relatedness. It seems that there is an inflation (false positives). Therefore, we may need to do some Quality Control to reduce the noisy.


The Manhattan plot is a way to visualize the position across genome of the significant markers based on their p-values. Figure3

The P values are represented in genomic order by chromosome and position on the chromosome (x-axis). From this plot, if we consider the threshold of 8 (10^-8) is evident that the unique chromosome with QTL effects is the chromosome 4. However, if we relax our threshold a bit we may find more significant SNPs.

This is not the end of the analysis, we may still need to clean up our data for rare alleles (MAF > 0.05), exclude those SNPs that are not in HWE and may exclude snp or individuals with too many missing genotypes. Next post I will tell you how to do it using PLINK.


Spot detection in proteomics


, , ,

Two dimensional polyacrylamide gel electrophoresis (2-D PAGE) is a powerful technique in proteomics aiming at protein separation and identification. The image processing of these gels has been playing a big rule in the success of protein identification, mainly because noise, dust particles, spot overlapping, fingerprints, and cracks can reduce the capacity of detection of true spots.

Approaches as thresholding and watershed algorithms have been used in spot detection, both have their pros and cons and it’s important to know which one should be used depending on the quality of the image. Two programs of those algorithm were implemented in python language. The most used libraries were skicit-image (for image processing in python), numpy to deal with arrays, scipy to easily located the centroids of each spot and some other modules. To be able to run the program it is necessary that both image and python script be placed in the same folder. The command used to run the thresholding algorithm is:

python inimage.png outimage.png
Where the first argument is the name of the script, the second argument is the name of the 2DPAGE image to be processed and the last argument is the name of the output image. The type of image (gif, png, jpg…) it is up to de user and can be modified. Besides the image output, the program prints the number of spots/segments detected in the input image and the correspondent intensity of that spot in the original image. To run the watershed program is similar:

python inimage.png outimage.png
The implementation of the thresholding algorithm was not hard, especially with the help of the available libraries in python for image processing. The algorithm is dependent of defining the ‘right’ threshold, therefore it is necessary to be very careful with this decision because relaxing the threshold may increase the number of false positive. Another problem of threshold is that fail in the presence of artifacts and noise, since we are just looking at the intensity of the colors it gets harder to detect what is or not a segment. The implementation of watershed algorithm was bit a more complicated, at the beginning was hard to understand the method but until the end I was able to see that watershed was more sophisticated. To reduce the noise of the matrix M was easier than I expected with the help of skimage.morphology. Depending on the way that we pick the markers, watersheds may lead to the detection of false positives. Furthermore, smoothing the image didn’t help me to increase the quality of the detection.

These images are representing the three steps used in thresholding: gel image input, the detection of the spots by intensity and the identification of the centroids (colored with red)

These images are representing the three steps used in thresholding: gel image input, the detection of the spots by intensity and the identification of the centroids (colored with red)

Both algorithms are found in the github repository:



RPKM and FPKM normalization units of expression


, , ,

As we learned before the analysis of count data from RNA-seq technology give us the number of sequence fragments that have been assigned to each feature. The feature could be a gene, isoform or exon.

From our sorghum data we quantified the gene expression in two different environmental conditions. Independent of the treatment, we know that there are genes that have different lengths (gene A has 600 bases and gene B has 1500 bases) with different abundance of reads. In order to compare those different genes A and B we should normalize the counts by the length of the gene.

There are two ways to normalize our count data using two different common units of expression: the Reads per kilobase of exon per million reads mapped (RPKM) or the more generic FPKM (Fragments  per kilobase of exon per million reads mapped). The RPKM has the following equation,

RPKM =  10^6 * C * 10^3 / L*M    where,

C = number of  mappable reads per each feature (in our case per gene)

L = number of the length of the feature (lenght of the gene) in kb (10^3)

M = Total number of mappable reads per sample in Millions Screenshot from 2015-03-08 22:45:51

Looking at the figure with normalized data we can conclude that from the sample 1 the gene A with a smaller length (600 B, RPKM = 3.33) has a higher expression compared to the gene C ( RPKM = 1.31). It is intuitive to think that the longer gene would probably have the more abundant expression space, but that is not a rule for expression data.  If we compare gene A and gene B from the first sample, even thought gene B is almost twice the size of gene A, both have a quite similar RPKM score(3.33, 3.64).

There are many biological process that were not taken into account when it is used the RPKM units of expression. This quantification was proposed for single-end sequencing method. However, some events of post-transcriptional modification (alternative splicing) could provide changes in relative abundance of transcripts that could not be detect when it’s used RPKM. It happens because in our count data we could have multi-reads, which are reads that may map to multiple transcripts (originated by multi-isoform genes or duplicated segments resulted by gene families). Therefore, in order to improve those weakness, the paired end sequencing technology was introduced, and the FPKM was created with a similar concept of RPKM.

Before introduce the other method FPKM, I think that we should review the difference between paired-end and single-end sequencing. The paired-end sequencing enables both ends of the DNA fragment to be sequenced (because the distance between each paired read is known). However, Single-read sequencing involves sequencing DNA from only one end. Let`s keep it in mind.

So the FPKM is calculated in the same way as RPKM, the unique difference is that instead of a read it will be counted a FRAGMENT. Normally the value of expression FPKM is used when the data was submitted to a paired-end sequencing method. But why? Because longer reads and pairs of reads from both ends of each RNA fragment can reduce uncertainty in assigning reads to alternative splice variants.

From the article that originated the method Cufflinks `Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms` we have the statement:

FPKM is conceptually analogous to the reads per kilobase per million reads sequenced (RPKM) measure, but it explicitly accommodates sequencing data with one, two, or – if needed for future sequencing platforms – higher numbers of reads from single source molecules. A fragment corresponds to a single cDNA molecule, which can be represented by a pair of reads from each end.

Therefore, FPKM is NOT 2 * RPKM if you have paired-end reads. FPKM == RPKM if you have single-end reads.

The most famous methods that were created to calculate RPKM and FPKM units of expression are rescue and Cufflinks method respectively.

It was defended in a few years by Marioni(2008) and Bullard (2010) the idea that the Poisson distribution could provide a good fit to the distribution of gene-level counts across replicate lanes, after normalization by total lane counts. DEgseq is using this concept. Maybe the explanation about why (or why not) and how could be (or could be not) a great topic for the next post!

Differential Analysis of count data (RNA-Seq)


, ,

Today we restarted the meetings at the Department of Genetics and Plant Breeding. We were discussing previously about alignment and all the theory of different ways to do it. Now we are going to learn more about the bioinformatics behind RNA.

The topic of this meeting was about differential analysis of count data from High-throughput sequencing assays, mainly using the RNA-Seq. Those technologies provide quantitative readouts in the form of count data. And these read count has been found to be (to good approximation) linearly related to the abundance of the target transcript. So why this would be important for a biological purpose? Because we could compare the differential abundance of counted reads from a specific especie submitted under different biological/climate conditions. Therefore, we could detect the candidate genes that would be differentially expressed.

Let’s say that we have plants that are exposed by 2 different conditions. Firstly by a normal environment with the best conditions of water, temperature, soil and low probability to pathogens. Secondly we would say that there is another group of plants that were submitted by an arid environment with a dry soil. We know that the phenotypic behaviour of those groups will be different, but how could we measure the genetic modification behind that? Maybe using the DNA would be a smart solution but when we use mRNA we are analysing the level and abundance of the expressed genes in that specific condition. So that is what RNA-Seq does.

Captura de tela de 2015-03-02 19:23:19

As we can see at the matrix of the figure, the row (i) represent the genes of the especie (Sorghum) (Sb0010s002010, Sb0010s003120, Sb0010s003420 and so on) and the columns (j) are the samples of the different treatments. In this case we just have  2: PEG(j) and H2O1(j). The value in the i-th row and the j-th column of the matrix tells how many reads have been mapped to gene i in sample j. In the first row of the matrix we can see that there isn’t any fragment of sequences (transcripts) associated by that gene Sb0010s002010. Therefore we could conclude that this gene has not being necessary on those 2 conditions since it was not expressed.

At the second row we can see another typical behaviour. All samples are showing a very similar number of counts (number of transcripts that were associated with that gene: Sb0010s003120). What could we conclude? Despite the gene has being expressed in those conditions, it has been EQUALLY worked in those 2 diferent conditions. We should check if they are statistically equal.

From the last row of the matrix we can see that the number of counted fragment from the first treatment in those 3 samples are almost the double of the 3 other samples of the second treatment. So maybe this gene, Sb0010s007570, could be a good candidate for differential expressed gene.

In the next posts I will write more about the package that we are going to use to run our analysis, it is called DESeq2 and can be found at the biocondutor website:

In the next meetings we are going to study a little bit more about the statistics behind the package. Our hypothesis is to know if the treatment variance is bigger than the residual variance between samples of the same condition.

The RNA-seq analysis has been used a lot in the last years specially because they are a powerful tool. Those technologies can provide comparison of transcription profiles from normal tissues versus cancer tissues, cells in high versus low nutrient environments, unstressed versus stressed cells or from distinct developmental stages of an organism. So it covers biological, medical and microbiological fields.


Continuing the Blog

Hey everyone, I am back to Brazil. And I decided that is very important to continue to share with the community my knowledge and the things that I’ve been working with. I just realized that this blog was viewed by people from the ENTIRE world and this makes me feel very grateful. I hope that somehow I am helping people to construct their science.
In Brazil I’ve been working with some projects related to sugarcane: some analysis using packages from R. I would like to share with you some tools that provides a better quality of work, one of these tools that I really appreciate is emacs. This text editor is very powerful and has a good interface. With that I can access R easily and use latex as well.
I will try to give some tips that will give you some base for you guys start to use these amazing tools.
I would like to receive a feedback from you  when it is possible.

Captura de tela de 2014-10-22 12:12:49

Thanks, obrigada 🙂

Maria Izabel

The project will continue!

Hej, I will continue to write down about the project. I would like to learn more about proteomics and the bioinformatics tools and this was one of the reasons that I decided to come to Sweden, because in Brazil I couldn’t find the courses that I took here in KTH (more in the Biotechnology Program). Furthermore other 2 Phd students (from Stockholm University) were interested in continue the project and learn more about proteomics as well. So our project will continue with additional topics. First I would like to summarize the previous project: we would like to know if the phosphosites that were unique for one of the tissues appear more frequently in alternatively spliced genes than expected by chance.

Now we would like to investigate two others phenomenons: lysine acetylation sites and lysine ubiquitination sites.

So our task is in a probabilistic model:

Pr( ? | “The gene is alternatively spliced”) > Pr( ? | “The gene is not alternatively spliced”)

We would like to investigate the statement for the cases:

Pr = (gene have phosphosite that are unique for one of the tissues ∩ alternatively spliced genes / alternatively spliced genes)  > Pr ( unique phosphosites ∩ not altern. spliced (NAS) / not alternatively spliced genes)

Pr = (gene have phosphosites (non unique) ∩ alternatively spliced genes / alternatively spliced genes) > Pr (gene have phosphosites ∩ NAS / NAS)

Pr =(gene have lysine acetylation site  ∩ alternatively spliced genes / alternatively spliced genes) > Pr (gene lysine acetylation site ∩ NAS / NAS)


Now, I will find the data necessary for the other post translational modifications. Professor Lukas gave some tips:

Acetylations are best done by Albert Hecks’s lab, or Chunaram Choudhary’s lab.

Ubiquitination sets can be downloaded from Chunaram Choudhary’s lab.

So I will try to find these data and attached into the repository acessed Actually it is a private repository.

Results and Conclusions

 In the graph “Genes expressed” we can see that from the total genes that were expressed in any tissue (5321), just 9,20% had a tissue especific splicing, and the rest of the genes expressed didn’t have tissue specific splicing.

In the second graph, called “Average number of protein isoform”, we can see that the abundance of protein isoform is not affected by the three different tissues, because they have very similar averages. Maybe if we had investigated others tissues, we could find difference. But since they have different functions in mouse metabolism, maybe the introduction of other tissues would not show much more information for the abudance of isoforms proteins. The isoforms that were found in each tissue can be found in other tissue. But we didn’t calculate for this graph the intersections. But we know that for any tissue the average of protein isoforms is: 2.872, we can conclude that they have very similar amount of protein isoforms.


for this second part, we investigated  the average of the phosphorylated sites. The presence of phosphorylated genes is much more frequent when there are tissue specific splicing forms (5.327) than without tissue specific splicing forms. Suggesting that there is a strong relation between phosphorylated genes and tissue specific splicing forms. This behaviour appeared when we compared genes with/without tissue specific phosphorylation sites.

The identification of genes with tissue specific phosphorylation sites were based on proteins expressed in at least two tissues where all have more than three peptide identifications each. This selection was done to minimize the effects of proteins being incorrectly identified as having tissue specific phospho-sites due to the small data set used.

Comparing the average of genes that were phosphorylated (2.872) with the average of the genes that was not phosphorylated(2.072) we can see that there is a highly significant difference (0.8 ich) suggesting that the presence of phosphorylated genes is more expected than the absence of the phosphorylation.

Therefore, we can conclude that genes are more often alternative spliced in different tissues.


We compared the relation of number of isoforms and the fraction of phosphorylated genes. As we can see below,  until 0,4 fraction of number of phosphorylated genes there is an increase in the number of isoforms per gene. After 6 isoforms the behaviour is no the same. Maybe because as a biological fact, is rare to appear more than 5 iso forms at the same gene at the same time.Image

This was our project. I learned a lot, and we learned it with our mistakes, with problems… But was very nice to share the knowledge here and will be good to remind the steps that we did if we want to reproduce this in another time. I think that i will continue to share the things that i will learn in Algorithmic Bioinformatics (DD2450) as well… Will be nice 🙂

Thanks, Tack, obrigada!

Finilazing the project


, , , ,

So, to do the parsimony analysis Yrin created himself code in python to acess the data and he compared the average of different parameters and then we had some conclusions.

Here below, there is the scheme that i created to summarize the steps that we did in our project. As I told you before, the MS- spectra data we acessed from (Hutlin et al. 2010). We created a decoy database just reversing our database using Proteomatic software. We do not expect to get any true matches from the “decoy” database. So, the number of matches that are found is an excellent estimate of the number of false positives that are present in the results from the real or “target” database. For run the decoy database against our database we used the program MSGF+. But we had problems with our target database, because we converted our MS-spectra database for .mzml but it was not working so we just converted again for .ms2, and then worked using the MSconvert program.
We used Percolator concomitantly with msgf2pin for improve the rate of peptide identifications from our collection of tandem mass spectra. Different of others softwares Percolator used q-value to describe the confidence of a specific identification ( identification of the peptide spectrum match (PSM) into correct and incorrect matches) and this software use semi-supervised machine learning  to discriminate between correct and decoy spectrum identifications.

And then after we ran the percolator we started to do our parsimony analyses with a q-value cutoff of 0.01. And Yrin calculated the average number of splice forms for genes that have proteins (splice forms) that are expressed in at least two tissues and have been identified as having tissue specific phospho sites. And calculated other avarages.

So the results  i will write in another post and will be attached our poster that we showed in the presentations as well ( i made the poster on LATEX (that is VEEERYYYYYYY good program when you want to use mathematical formules, quality of image, and a very beautiful words  but we had some problems in configurate the data inside the poster). But when you have time to do your project (specially thesis) it is a good way to create a document. To download latex you can click here:



Since sunday we are meeting trying to finish our project. Yrin he is spending his time trying to solve the problems of percolator and he started to do the Parsimony analysis. I’ve started to do the poster and some figures to lay out our work.

So what is Parsimony?

The parsimony is the smallest set of proteins (or protein groups) accounting for a set of confidently identified peptides. Since a given peptide sequence can map to more than one protein (as we saw on alternative splicing), a method of PARSIMONY is desired to reduce the complexity of the resulting set of proteins. The following parsimony rules and definitions are designed to reduce the number of proteins and yet still account for all of the observed spectra and isoforms.

So the parsimony classifies the peptides as:

Indeterminate For a given scan, only the top scoring hit is used. If there is more than one match that ties for the top score, then the peptide is indeterminate.
Distinct A peptide that is assigned to exactly one protein.
Shared A peptide that is assigned to more than one protein.


• “Plurality ought never be posed without necessity.”
– William of Occam

Simplify the things is better but sometimes is dangerous. So we should watch out and choose the good tools for analisy our data.

Statical concepts of FDR and Q-value for mass-spectrometry


, , ,

FDR, q-value and Posterior Error Probability (PEP) are both used to assess and score the accuracy of mass spectrometer data. Since we are dealing with real experiments, liable to suffer errors, these scoring are important tools that can complement each other on the statistical analysis and their interpretation.

False Discovery Rate (FDR) for a set of genes/proteins/peptides is the expected percent of false prediction in the set of predictions. For example: our mass spectrometry experiment is to match every observed spectrum to a peptide in the given database, and then separate the list of peptide spectrum match (PSM) into correct and incorrect matches. If we set an FDR threshold of 1%, this means that we are expected to accept that in a list of PSMs in which 99% of the matches are correct and 1% are not.

To describe the confidence of a specific identification, the q value is defined as the minimal FDR required to accept the identification, after having considered all possible thresholds. However, the q-value is an individual measurement of the false discovery rate (minimum false discovery rate (FDR)). Thus, although the q-value is associated with a single observation, it is fundamentally a rate and hence is a property of the collection of scores s1,…, si. For an example: a q -value of 0.01 for peptide ESKHTRG matching spectrum s means that, if we try all possible FDR thresholds, then 1% is the minimal FDR threshold at which the PSM of the peptide to s will appear in the output list (Käll et al. 2008).

The FDR has become a widely used measure for reliability of PSMs (MASCOT, SEQUEST, X!Tandem) since FDRs are not a function of the underlying score, the q-value metric, in the field of genomics, was applied to mass spectrometry by Käll et al (2008).

And now we are trying to finalize our analysis using the parsimony tools for investigate our data.

    • Käll, L., et al., Posterior error probabilities and false discovery rates: Two sides of the same coin, Journal of Proteome Research 7 40-44 (2008).
    • Käll L., Storey L.D., Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry. WS Noble Bioinformatics 24 (16), i42-i48
    • Markus Brosch, Lu Yu, Tim Hubbard, Jyoti Choudhary (2009) Accurate and sensitive peptide identification with Mascot Percolator., 3176-3181. In Journal of proteome research.
    • Reiter, L.; Claassen, M.; Schrimpf, SP.; Jovanovic, M.; Schmidt, A.; Buhmann, JM.; Hengartner, MO.; Aebersold, R. (Nov 2009). Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol Cell Proteomics.
Some articles and important links to read: