Guide: Creating a Correlation Matrix on Gene Expression

Gene expression is a complex topic, but fortunately, we have Julius to help guide us! In today’s guide, we will use real-world data from NCBI GEO to highlight differences in gene expression between normal breast tissue and cancerous breast tissue. The original dataset can be found here. For tutorial purposes, I have modified the dataset to make it easier to work with as an introduction to gene expression. You can access the modified dataset here if you would like to follow along with the tutorial.

The genes we are examining in this dataset are associated with the development of breast tumors. This is not an extensive list, as new information is documented everyday as research progresses, but this should give you an idea of what these genes do in relation to breast cancer development:

1. ERBB2 (HER2): Mutations in this gene affect the PI3K/Akt signaling pathway; related to growth and survival.

2. BRCA 1 & 2: Mutations in these genes are linked with breast and ovarian cancer. They are involved in DNA repair mechanisms.

3. TP53: Often referred to as the “guardian of the genome” as it is responsible for genomic stability. Mutations often lead to loss of function and increased tumor development and progression.

4. ESR1 (Estrogen Receptor 1): This gene encodes the estrogen receptor, important in the development and progression of breast cancer. This gene is the target for hormone therapy in individuals with breast cancer.

5. PIK3CA: Mutations in this gene are common in breast cancer and affect the PI3K/Akt signaling pathway.

6. PTEN: This tumor suppressor gene becomes inactivated in many cancers, including breast cancer.

7. AKT1: Plays a role in the PI3K/Akt signaling pathway and can be involved in breast cancer when mutated.

8. MUC1: Often overexpressed in many breast cancers, it encodes a protein involved in cell signaling and protection.

9. CDH1 (E-Cadherin): Involved in cell adhesion, often lost in more invasive forms of breast cancer.

10. MKI67 (Ki-67): Associated with cell proliferation and is used as a marker to determine the growth rate of tumors.

Now that we have a generalized understanding of genes involved in breast cancer development, let’s take a look at how to create a correlation matrix with this dataset to visualize gene expression in normal versus cancerous breast tissue.

Step 1: Import the dataset

Our first step is to bring in the dataset.

Prompt: please load the dataset from Googlesheets

Julius has successfully loaded in our dataset.

Step 2: Modifying Dataset

Our next step is to drop the metastasis column and split the dataset into two separate data frames.

Prompt: please drop the metastasis column and then split this dataset by ‘breast tumor’ and ‘normal breast tissue’.

Julius has successfully removed the metastasis column and split the dataset. We can now move onto descriptive statistics.

Step 3: Run Descriptive Statistics
Let’s prompt Julius to run descriptive statistics separately on each gene in each dataset.

Prompt: please perform descriptive statistics on each dataset on each gene.

Looking at the descriptive statistics, we can get a brief idea of how gene expression changes between normal and cancerous breast tissue by examining the mean. This should give us some indication on patterns that we may see as we continue our analysis.

Step 4: Checking normality
To determine which correlation analysis we are going to use for the correlation matrix, we can run a normality test on both datasets to see if they pass or fail. If they pass, we can use Pearson’s correlation; but if they fail, we will use Spearman’s rank correlation.

Prompt: please create separate histograms for each gene in each dataset. Then run a normality test on each gene distribution.



Based on the histograms and the normality tests, we can conclude that our datasets do not follow a normal distribution. Therefore, we will use Spearman’s rank correlation to create this correlation matrix.

The other assumptions that Spearman’s rank correlation requires are:

  1. Monotonic relationship: as one variable increases, the other either increases or decreases → dataset passes
  2. Large sample size: in total we have 527 entries between the two datasets (Normal N = 242; Cancer N = 330) → dataset passes
  3. No extreme outliers: some outliers but nothing too extreme (step not shown here) → dataset passes
  4. Independent observations: assume independence based on methodology → dataset passes

Let’s continue to the final step!

Step 5: Creating Correlation Matrix
We can now ask Julius to create a correlation matrix using Spearman’s correlation.

Prompt: calculate Spearmans rank correlation for each gene and create a correlation matrix.

Julius has successfully created a correlation matrix with visually appealing colours. From a quick inspection of this matrix, we can note that in normal breast tissue there seems to be a higher correlation (more red) with genes. This indicates that as one gene increases in expression, the other one does as well. Whereas, when we look at the cancer breast tissue, we see more negatively correlated genes, or a less intense positive relationship between gene 1 and gene 2 (lots of cool blue colours).

The side-by-side comparison highlights how gene expression can change in cancerous tissue versus normal tissue.

Keywords: GPT, AI Correlation Matrix, AI Statistical Analysis, AI Stats, Correlation matrix, Gene, Gene relationships, Co-expression, Spearman’s rank correlation.