Introduction

Artificial intelligence (AI) is transforming healthcare by enabling advanced analysis of complex biomedical data and supporting clinical decision-making. Traditional statistical methods often struggle with high-dimensional datasets, while machine learning can identify patterns directly from large amounts of data. According to Abtahi and Astaraki (2026), AI is particularly valuable in healthcare when analysing large datasets where traditional methods are limited. One important application of AI is cancer classification using gene expression data. These datasets measure the activity of thousands of genes simultaneously, helping researchers identify molecular patterns linked to specific cancer types. However, the high dimensionality of such data makes manual analysis difficult. This report conceptualises an AI system designed to classify cancer types using gene expression data. The dataset, obtained from Kaggle, contains 802 samples with gene expression values for over 20,000 genes, representing five tumour types: BRCA, KIRC, COAD, LUAD, and PRAD (Mohapatra, 2022). The report outlines the conceptual design of the AI system, including task identification, AI justification, data utilisation, methodology, data preparation, and evaluation methods.

Task Identification

The core task of this project is the multiclass classification of cancer types using gene expression profiles. The system predicts the tumour category of a biological sample based on thousands of gene expression measurements. Traditionally, cancer classification using genomic data relies on manual analysis, statistical modelling, and laboratory testing, which can be inefficient due to the high number of variables involved. An AI-based system can automate this process by learning patterns from historical datasets, enabling faster and more consistent classification of new samples. This approach can support biomedical research and potentially assist clinicians in identifying tumour subtypes.

AI Justification

Applying AI to genomic data analysis offers several advantages over traditional methods. Machine learning algorithms can identify complex relationships among thousands of gene expression variables simultaneously, which is difficult for conventional statistical approaches due to the high dimensionality of genomic datasets. AI also enables scalable analysis, allowing large genomic datasets to be processed efficiently. In addition, machine learning models can improve diagnostic accuracy by detecting molecular patterns that may not be visible through manual analysis. AI has already demonstrated strong performance in biomedical applications such as medical imaging, disease prediction, and genomic analysis (Esteva et al., 2019). Furthermore, AI systems can continuously improve as more data becomes available, supporting the development of personalised medicine where treatments are tailored to individual genetic profiles.

Data Utilisation

Dataset Description

The dataset used in this study was obtained from Kaggle (Mohapatra, 2022) and consists of two primary files: data.csv and labels.csv. The data.csv file contains gene expression measurements for each patient sample, with approximately 20,000 gene features per sample across 802 samples. These gene expression values represent the activity levels of individual genes and provide detailed molecular information about each tumour sample. The labels.csv file contains the corresponding tumour classification labels for each sample, identifying five cancer types: BRCA (Breast cancer), KIRC (Kidney renal clear cell carcinoma), COAD (Colon adenocarcinoma), LUAD (Lung adenocarcinoma), and PRAD (Prostate adenocarcinoma). By combining these two files, each patient sample can be associated with thousands of gene expression values along with its corresponding cancer type label, forming a dataset suitable for machine learning analysis and cancer type classification.

Data Preparation and Exploratory Analysis

The dataset was first examined for missing values and inconsistencies, and no major missing values were identified, allowing it to be used directly for modelling. Since gene expression values vary considerably across different genes, feature scaling was applied using the StandardScaler method to standardise the data, ensuring that each feature has a mean of zero and unit variance. Due to the large number of gene features in the dataset, dimensionality reduction techniques were applied to improve computational efficiency and enable effective data visualisation. As part of the exploratory analysis, Principal Component Analysis (PCA) was used to reduce the dimensionality of the dataset while retaining most of the variance present in the original gene expression data.

Figure 1. PCA visualisation of gene expression samples.

The PCA plot demonstrates that some cancer types form distinguishable clusters in reduced feature space, suggesting that gene expression patterns contain useful information for classification.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE was used to visualise high-dimensional gene expression data in two dimensions. Unlike PCA, t-SNE focuses on preserving local relationships between samples.

Figure 2. t-SNE visualisation of tumour clusters.

The t-SNE representation reveals clearer separation between tumour categories compared to PCA, highlighting potential biological differences between cancer types.

Gene Expression Heatmap

A clustered heatmap was generated to visualise gene expression patterns across samples. Hierarchical clustering was applied to group samples based on similarity in gene expression profiles. This heatmap on Figure 3 illustrates how samples with similar gene expression patterns cluster together, which may correspond to specific cancer types.

Figure 3. Hierarchical clustering heatmap of gene expression data (18,12)

K-Means Clustering

K-means clustering was applied to explore whether tumour samples naturally group into clusters based on gene expression data. The clustering analysis provides insight into potential relationships between samples and supports the hypothesis that tumour types can be distinguished using gene expression patterns.

Figure 4. K-means clustering results visualised in reduced feature space.

AI Methodology, Evaluation and Limitations

Due to the high dimensionality of genomic datasets, the proposed AI system combines dimensionality reduction and machine learning techniques to analyse gene expression data effectively. Methods such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-Distributed Stochastic Neighbor Embedding (t-SNE) are used for dimensionality reduction and visualisation, while K-means clustering and supervised classification models support pattern detection and tumour type prediction. Model performance can be evaluated using metrics such as accuracy, precision, recall, and the F1 score, with a confusion matrix providing detailed classification results. ANOVA testing can also assess whether gene expression differences between tumour groups are statistically significant. Model reliability is assessed using train–test splitting and cross-validation, along with visual analysis through clustering and scatter plots. However, the dataset contains many gene features but relatively few samples, which may increase the risk of overfitting, and gene expression data may include biological variability and noise. Therefore, AI systems should be used as decision-support tools rather than replacements for clinicians, ensuring that human expertise remains central to healthcare decision-making (Abtahi & Astaraki, 2026).

Conclusion

This report conceptualised an AI system designed to classify cancer types using gene expression data. The proposed system integrates dimensionality reduction techniques, clustering algorithms, and supervised machine learning models to analyse high-dimensional genomic datasets.The exploratory analysis demonstrates that gene expression patterns contain meaningful molecular signals capable of distinguishing tumour types. AI models therefore offer strong potential for supporting biomedical research and improving diagnostic workflows. As healthcare datasets continue to grow in complexity, AI-based systems will play an increasingly important role in enabling precision medicine and personalised treatment strategies.

References

Abtahi, F., & Astaraki, M. (2026). AI in healthcare: Foundations and technical methods (Vol. 1: Foundations). Fivation AB.

Esteva, A., Robicquet, A., Ramsundar, B., et al. (2019). A guide to deep learning in healthcare. Nature Medicine, 25(1), 24–29.

Mohapatra, S. (2022). Healthcare machine learning notebook. Kaggle.
https://www.kaggle.com/code/shibumohapatra/healthcare

Appendix

Appendix A: Data Processing and Analysis Code

The exploratory data analysis, data preprocessing, dimensionality reduction, and clustering procedures were implemented in Python using libraries including NumPy, Pandas, SciPy, Matplotlib, Seaborn, and Scikit-learn. The dataset files (data.csv and labels.csv) were first loaded and merged into a single dataset to enable comprehensive analysis of gene expression values across tumour samples. The complete implementation of the analysis, including the Jupyter Notebook and dataset processing scripts, is publicly available on GitHub.

GitHub Repository:https://github.com/brunorsreis/cancer-gene-expression-classification

A1. Dataset Loading and Merging

import numpy as np
import pandas as pd

data = pd.read_csv('data.csv')
label = pd.read_csv('labels.csv')

df = pd.merge(label, data)
df.head()

The merged dataset contains gene expression values for approximately 20,000 genes across 801–802 tumour samples, along with their corresponding cancer type labels.

A2. Hierarchical Clustering Heatmap

A hierarchical clustered heatmap was generated to visualise patterns in gene expression across cancer types.

import seaborn as sns
import matplotlib.pyplot as plt

heatmap_data = pd.pivot_table(
    df,
    index='Class',
    values=df.select_dtypes(include='number').columns,
    aggfunc='mean'
)

sns.clustermap(heatmap_data, figsize=(18,12))
plt.savefig('clustered_heatmap.jpg', dpi=150)

This heatmap groups samples based on similarity in gene expression profiles and highlights potential molecular patterns associated with specific tumour types.

A3. Standardisation of Gene Expression Data

Because gene expression features have different scales, standardisation was applied using StandardScaler.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop(['Class','Unnamed: 0'], axis=1))

Standardisation ensures that each feature contributes equally to the analysis.

A4. Principal Component Analysis (PCA)

PCA was used to reduce dimensionality while retaining most of the dataset variance.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

The resulting PCA components allow visualisation of tumour samples in two-dimensional space.

A5. t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE was used to visualise local relationships between samples in high-dimensional space.

from sklearn.manifold import TSNE

tsne = TSNE(learning_rate=50)
tsne_features = tsne.fit_transform(df.drop(['Class','Unnamed: 0'], axis=1))

This method helps identify clustering patterns between cancer types.

A6. Linear Discriminant Analysis (LDA)

LDA was used as a supervised dimensionality reduction technique.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda = LDA(n_components=2)
X_lda = lda.fit_transform(X, y)

LDA maximises separation between tumour classes by projecting the data into a lower-dimensional feature space.

A7. K-Means Clustering

K-means clustering was used to identify natural groupings of tumour samples.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, n_init=5)
kmeans.fit(X_pca)
clusters = kmeans.labels_

This analysis explores whether samples naturally group according to their cancer type.

A8. Statistical Testing (ANOVA / F-test)

ANOVA testing was used to evaluate whether gene expression differences between tumour types were statistically significant.

import scipy.stats as stats

F, p = stats.f_oneway(group1, group2, group3, group4, group5)

Genes with p-values < 0.05 were considered significantly different across tumour groups.

brunors

*The views expressed here are my own and do not represent those of my employer.*

Hello, I’m Bruno — a dual citizen of Brazil and Sweden. I bring a global perspective shaped by experiences in both South America and Europe, with a strong focus on collaboration and innovation across cultures. I am a Computer Scientist, PhD Candidate in Information and Communication Technologies, focusing on Data Science and Artificial Intelligence, and hold dual Master’s degrees in Data Science and Cybersecurity. With over fifteen years of international experience spanning Brazil, Hungary, and Sweden, I have collaborated with global organizations such as IBM, Playtech, and Oracle, as well as contributed remotely to projects across multiple regions. My professional interests include Databases, Cybersecurity, Cloud Computing, Data Science, Data Engineering, Big Data, Artificial Intelligence, Programming, and Software Engineering, all driven by a deep passion for transforming data into strategic business value.

Conceptualising an AI System for Cancer Type Classification Using Gene Expression Data

Appendix

Appendix A: Data Processing and Analysis Code

A2. Hierarchical Clustering Heatmap

A3. Standardisation of Gene Expression Data

A4. Principal Component Analysis (PCA)

A5. t-Distributed Stochastic Neighbor Embedding (t-SNE)

A6. Linear Discriminant Analysis (LDA)

A7. K-Means Clustering

A8. Statistical Testing (ANOVA / F-test)

Related posts

Appendix

Appendix A: Data Processing and Analysis Code

A2. Hierarchical Clustering Heatmap

A3. Standardisation of Gene Expression Data

A4. Principal Component Analysis (PCA)

A5. t-Distributed Stochastic Neighbor Embedding (t-SNE)

A6. Linear Discriminant Analysis (LDA)

A7. K-Means Clustering

A8. Statistical Testing (ANOVA / F-test)

Related posts

Supercharging Your Application with Redis + Aurora PostgreSQL: A Practical Guide from Local Setup to Full AWS Deployment

Building a Semantic Search API with MySQL Vector Search, Oracle Cloud, and an NBA Kaggle Dataset

Java vs. Python: Which Language Is Faster? A Quick Experiment Inspired by an Interview Question