Further Analysis

GEO2R is just one tool for analysis. There are many other awesome software that are also used in bioinformatics analysis.

We will demonstrate how to use the Search Tool for Retrieval of Interacting Proteins (STRING) Database. https://string-db.org/

It's a free online tool that lets you enter a list of genes. It will process the list and output a network chart that shows the interactions between these genes, specifically their protein products.

(Remember that many, though not all, genes work through being translated into proteins, which then do things. Proteins are the workhorse of life)

Example STRING network

We'll use the data from the COVID-19 dataset in a STRING analysis.

At this point, we need a list of the genes that are DIFFERENTIALLY EXPRESSED in COVID-19 patients compared to controls. Our goal with STRING is to figure out how these genes are related to each other. Maybe they are involved in similar biological pathways?

If you haven't already, you'll now need to download the full table of results from GEO2R.

It'll download as a TSV (tab separated value) file. The next step is to import into Google Sheets.

Here is a good tutorial: https://support.google.com/docs/answer/40608?co=GENIE.Platform%3DDesktop&hl=en

Create a new spreadsheet (empty).

Hit file > import.

Go to upload

Simply drag and drop in the TSV file from GEO2R!

Click import with the default settings.

Now you'll have the data imported in.

Column G has the name of the gene.

Column H has the DNA sequence.

Column C has a P value.

Column F has a log FC value, the measure of how different the gene is expressed between COVID-19 and healthy controls.

In case you weren't able to import the data or don't have storage on your computer, here is the spreadsheet: https://docs.google.com/spreadsheets/d/1Bnh1j5NbsCujGvmyvqEgyajwHHvwtMUM_DAtX1Nf0TI/edit?usp=sharing

Now, go to this link: https://string-db.org/cgi/input?sessionId=b2nZUNKDUxE2&input_page_show_search=on

On the Google Spreadsheet, select and copy the first 500 genes in the ORF column.

The column is in order of smallest p value.

Usually, scientists set a p value cutoff. If the p value is greater than a certain amount, the corresponding gene will not be considered.

In this example, there are over 10,000 genes that have p value < 0.01 which is too much for GEO2R to handle.

For the purpose of keeping this tutorial simple, let's just select the top 500 genes by lowest p value, These genes are the most likely to be truly different in COVID-19 patients. There is a very low probability that they aren't significant in some way.

In STRING, paste in the list of gene names.

Make sure STRING is set to Organism: homo sapiens (humans).

Click SEARCH and then CONTINUE on the next page.

Wait for a few minutes.

You will get this cool network graph. Each bubble is a gene and its protein it makes. A line between two bubbles indicates the two genes interact with each other.

You can click and drag the bubbles.

If you click on a bubble/gene, a card will pop up that tells you more information about the gene.

Let's look at gene TNF. There are many lines connecting it to other genes which indicates it is important in COVID-19 infection.

We now know that TNF is a cytokine that can cause cell death of some cancer cells. We can also see its protein structure.

Let's see whether TNF has a role in COVID-19.

Simply Google it!

Wow, there are many research studies that associate TNF with COVID-19! Apparently, increases in TNF expression in the human body are associated with WORSE COVID-19 side effects!

As you can see, our bioinformatics analysis let us identify an important gene in COVID-19 infection!