1 of 10

English

COVID-19 Bioinformatics: Introduction

A free course by Andrew Gao

Updates

2,500 users and counting.
Certificate now available on the Next Steps Page
Spanish and Mandarin translations coming

Ever wondered how exactly scientists "do science"?

Interested in COVID-19 and its effects on your body?

If you said yes to any of the above, this course is for you!

In this course, you'll learn how to do bioinformatics analysis and do research on COVID-19 transcriptomics (gene expression) data. We'll be studying differences in gene expression between COVID-19 patients and healthy people to learn how COVID-19 affects the body.

This course will prepare you to do your own original research.

About the instructor

Hi! I'm Andrew, a high school student passionate about science education. I saw the misinformation crisis stemming from COVID-19 and wanted to use my bioinformatics know-how to educate people for free. I created this course to accomplish that mission.

Email me anytime: andrew@helyxscience.org

What will I learn?

You will learn how to find public gene expression datasets online and analyze them using the GEO2R software and numerous web-based bioinformatics tools and databases.

Finding gene expression datasets
Performing differential expression analysis
Analyzing and interpreting results

How long will this course take?

Depending on your skill with computers, it could take anywhere from one to three hours.

Do I need to know how to code?

Nope! This course is designed so that zero programming is needed.

What technical requirements are there?

You need a laptop or computer with access to the Internet. Additionally, you need either Excel, Google Sheets, or another table editing software. You can get Google Sheets for free with a Google Account.

What if I don't know any biology?

That's totally fine! The next page has a list of resources that you can use to learn everything you need to know for this course.

Is there a certificate available?

Yes. On the Next Steps page, you can complete a end-of-course survey after which you will receive a certificate of completion.

Other

You can download these lessons in PDF form if you have limited Internet access.

Background Knowledge

Important information about biology and biotechnology to know.

To understand this course, you should be familiar with:

What a microarray is and what it does
What is gene expression?

Essentially microarrays are cool devices that let us measure how much each gene is "expressed" in a given sample, such as a patient's cells.

For example, let's say there's a gene called ABCD. Hypothetically, ABCD is "expressed" as a protein that leads to cancer somehow. We'd expect ABCD to be more "expressed" in people with cancer.

We can use a microarray to confirm our hypothesis by measuring how much ABCD expression there is in cancer patients vs. non-cancer people.

What exactly is gene expression? Gene expression is how much a gene is expressed. Genes are expressed when they are transcribed to RNA and then made into protein. Microarrays can measure the amounts of RNA for each gene and from that information, you can infer how much the gene is expressed.

Don't understand this explanation? Check out the free recommended resources below.

Please review the following resources if you are unfamiliar with the central dogma and biotechnology. Credit to the original creators!

Overview

What to expect

How to access and find bioinformatics data for free online
How to analyze the data using GEO2R
How to interpret results
How to extend analysis through PANTHER, Gene Ontology, Gene Cards, and other bioinformatics databases
Next steps (making your own project!)

Have questions? Email me: andrew@helyxscience.org

Gene Expression Omnibus

Getting Data

How to access public gene expression data

There is a video available for this lesson instead of reading the text, if you prefer.

The first thing you will need is data!

Scientists can upload data from their experiments to a website called the Gene Expression Omnibus. Anyone (including you) can visit the website and download the data.

In this course, we will do an example project on COVID-19. So, the first step will be to get gene expression data on COVID-19 infections.

The link above should take you to this page,

In the search bar, type in COVID-19 and click Search.

It will search the GEO database for datasets matching your search term (feel free to switch "COVID-19" with a different disease that you would like to research).

On this page, there is a lot of data. You can filter data by organism (human, mice, etc.), type of study, age, and many other factors.

Feel free to browse through the datasets and read the titles.

I chose this dataset for our research:

Please visit the page.

You will see the following page. This page is the "Accession Display" and is basically a summary of the dataset and what it is about. You can see details such as the title, the day it was submitted, which lab it came from, and more.

For example, this particular dataset is fairly new. It was submitted on January 14, 2021 by scientists from Zhejiang University in China.

The important details to look at on any Accession Display page are: 1. Title 2. Summary 3. Overall Design

You should always read these carefully to make sure the dataset is what you are looking for.

In our case, this dataset contains microarray data on the transcriptome (RNA) of peripheral blood mononuclear cells from COVID-19 patients and healthy people.

This means that the scientists who did this experiment measured how much each RNA was expressed in people's blood. Their purpose was to determine the differences in gene expression between people with COVID-19 and people without COVID-19.

This research has implications for treatment, diagnosis, and for helping doctors understand the effects of COVID-19 on our bodies. For example, with diagnosis, if we wanted to design a COVID-19 test, we could use this data to identify RNA molecules that are more common in the blood of people with COVID-19 than healthy people. Then, COVID-19 testers could use a microarray to determine whether people have high levels of those RNA molecules, and thus, could they potentially have COVID-19.

If you don't know what RNA or microarrays are, please read the "Background Knowledge" page to get up to speed.

In the next lesson, we will go over how to actually analyze the data.

Analyzing Data

The MOST IMPORTANT thing you must look out for on the Accession Display page is the blue "Analyze with GEO2R" button at the bottom of the page. You are only able to perform the data analysis described in this lesson if the blue button is there. Not every dataset has the blue button.

Now that we have our dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164805), the next step is to analyze the data.

Usually, the raw data is a huge file with potentially thousands of rows and columns. It's not easy to work with as a beginner and without using code. Luckily, the awesome "Analyze with GEO2R" button can do the analysis for us!

From the accession page, click on the blue Analyze with GEO2R button.

You should see the following page.

This page will allow you to split the samples into an experimental group and a control group (you have to tell the GEO2R program what to compare with what).

ALWAYS define the EXPERIMENTAL/DISEASE/MUTANT group FIRST!!! This is so the system can properly define what is positive expression and negative expression.

First, click on the Title column bar to automatically alphabetically sort your samples. It makes it easier to group the samples by type.

This video shows the steps to define sample groups. Sorry for the low resolution.

In order to select more than one sample at once, simply click on one sample, hold the shift button, and click on another sample. All the samples in between will get selected as well.

Your screen should look like this. In this example, we are trying to study the differences in gene expression between COVID-19 patients and normal controls.

After splitting your data into 2 groups (10 COVID-19 and 5 controls), click the blue Analyze button (you may need to scroll down the screen).

It may take a few minutes but eventually GEO2R will display the results from your analysis.

Go to the next lesson to learn how to interpret these results.

Interpreting Results

GEO2R provides many graphs and a complicated-looking table. In this lesson, you'll learn what they mean.

At this point, your screen should look like this.

Let's take a look at the results.

This table shows the top 250 differentially expressed genes by p-value. This means that it ranks the genes by how confident we are that the gene is different between people with Covid-19 and people without.

The P Value column is for statistical significance. You want to only consider genes that have P value of less than 0.05. This means that you are 95% confident that the gene is statistically significantly differentially expressed.

The logFC column is also important. FC stands for fold change and is a ratio of the expression of the gene in COVID-19 patients vs. controls. A greater fold change means there is a bigger difference. Note that it is LOG FC, meaning that the value in the table is the log 2 version of the actual fold change.

For example, if a given gene is 16 times more expressed in COVID-19 than healthy people, the logFC would be 4 since 2^4 = 16.

The SEQUENCE column gives you the sequence of the gene and the ORF is the gene name.

If you click on a row of the resulting table, a graph will appear. The graph shows the expression of that gene in each patient sample. As you can see in this example, all of the COVID-19 patients have higher expression of gene TEX101 than controls.

Click on Download full table if you want to download the whole data table for further analysis (not covered in this course). This full data table will contain the results for all the thousands of measured genes, not just the top 250 most differentially expressed ones.

Further Analysis

GEO2R is just one tool for analysis. There are many other awesome software that are also used in bioinformatics analysis.

We will demonstrate how to use the Search Tool for Retrieval of Interacting Proteins (STRING) Database. https://string-db.org/

It's a free online tool that lets you enter a list of genes. It will process the list and output a network chart that shows the interactions between these genes, specifically their protein products.

(Remember that many, though not all, genes work through being translated into proteins, which then do things. Proteins are the workhorse of life)

We'll use the data from the COVID-19 dataset in a STRING analysis.

At this point, we need a list of the genes that are DIFFERENTIALLY EXPRESSED in COVID-19 patients compared to controls. Our goal with STRING is to figure out how these genes are related to each other. Maybe they are involved in similar biological pathways?

If you haven't already, you'll now need to download the full table of results from GEO2R.

It'll download as a TSV (tab separated value) file. The next step is to import into Google Sheets.

Here is a good tutorial:

Create a new spreadsheet (empty).

Hit file > import.

Go to upload

Simply drag and drop in the TSV file from GEO2R!

Click import with the default settings.

Now you'll have the data imported in.

Column G has the name of the gene.

Column H has the DNA sequence.

Column C has a P value.

Column F has a log FC value, the measure of how different the gene is expressed between COVID-19 and healthy controls.

In case you weren't able to import the data or don't have storage on your computer, here is the spreadsheet:

Now, go to this link:

On the Google Spreadsheet, select and copy the first 500 genes in the ORF column.

The column is in order of smallest p value.

Usually, scientists set a p value cutoff. If the p value is greater than a certain amount, the corresponding gene will not be considered.

In this example, there are over 10,000 genes that have p value < 0.01 which is too much for GEO2R to handle.

For the purpose of keeping this tutorial simple, let's just select the top 500 genes by lowest p value, These genes are the most likely to be truly different in COVID-19 patients. There is a very low probability that they aren't significant in some way.

In STRING, paste in the list of gene names.

Make sure STRING is set to Organism: homo sapiens (humans).

Click SEARCH and then CONTINUE on the next page.

Wait for a few minutes.

You will get this cool network graph. Each bubble is a gene and its protein it makes. A line between two bubbles indicates the two genes interact with each other.

You can click and drag the bubbles.

If you click on a bubble/gene, a card will pop up that tells you more information about the gene.

Let's look at gene TNF. There are many lines connecting it to other genes which indicates it is important in COVID-19 infection.

We now know that TNF is a cytokine that can cause cell death of some cancer cells. We can also see its protein structure.

Let's see whether TNF has a role in COVID-19.

Simply Google it!

Wow, there are many research studies that associate TNF with COVID-19! Apparently, increases in TNF expression in the human body are associated with WORSE COVID-19 side effects!

As you can see, our bioinformatics analysis let us identify an important gene in COVID-19 infection!

Next Steps

Where to go from here

Instead of comparing COVID-19 to controls, what about comparing severe COVID-19 to mild COVID-19? You could find genes that are linked to worse side effects. This could help doctors with assigning treatments. The dataset we used in this course has data for severe and mild COVID-19, if you remember.
Find your own dataset! Gene Expression Omnibus has thousands of public gene expression datasets. You can find your own on virtually any disease you want, from Alzheimer's to pancreatic cancer to diabetes.
Check out these papers I wrote for examples/inspiration. (bottom of page)
Certificate: To get your certificate, complete this feedback form:

Remember, teens can do research too!

Bonus: mRNA Vaccines

Curious about how the Pfizer and Moderna mRNA vaccines work? Here's a brief introduction.

Your immune system cells learn to attack unwanted viruses by recognizing the proteins on the surface of viruses.

However, it can be hard for your immune system to learn the viral proteins quickly.

A mRNA vaccine contains pieces of mRNA. These mRNA have the instructions for making a protein found on the surface of the COVID-19-causing virus, SARS-CoV-2. It's important to note that the mRNA ONLY has instructions for making the specific protein, not the entire virus itself.

Your body's cells will read the mRNA and produce the viral surface protein. Then, your immune cells will be able to safely learn to recognize SARS-CoV-2 without actually being in danger.

Next time SARS-CoV-2 tries to enter your body, your immune cells will already be able to detect it, through its surface proteins.

I have provided an extreme simplification. If you'd like to learn more details (from more credible sources), check out these links!