A lot of research is happening with regard to cancer, and thankfully, a lot of progress has been made in understanding the aetiology of this complex disease. “Driver” genes are known to play a major role in causing cancer. Many of these genes have been identified, but still many driver genes need to be identified. Till now, only genes that harbour a high rate of mutations, thus causing cancer, have been identified. But what about the numerous driver genes that haven’t been identified?
In this study, conducted by Ms. Malvika Sudhakar and Dr. Karthik Raman from the Bhupat Jyoti Mehta School of Biosciences, IIT Madras, and Prof. Raghunathan Rengaswamy from the Department of Chemical Engineering, IIT Madras, a new model for predicting genes important for cancer progression called cTaG has been introduced.
The name cTaG reminds one of the four bases that are the building blocks of DNA, and is expanded as “classify TSGs and OGs”, where TSGs refer to Tumour Suppressor Genes, and OGs refer to Oncogenes. Tumour Suppressor Genes help to protect and defend the cell from cancer. When such a gene loses its function due to mutations, a selective growth advantage is conferred upon the cell. Proto-oncogenes undergo gain of function mutations to become an oncogene.
Although many TSGs and OGs have been discovered for different cancer types, most of them are highly potent and recurring in different patients. A key aim of this study is to find rarer low-frequency driver genes by classifying them into TSGs and OGs.
There are two classes of methods to identify driver genes based on mutational data:
1. The first class of methods rely on the rate of mutations in genes for a set of patients to identify driver genes. Here the background mutation rate is estimated and genes with a significantly higher mutation rate are identified as driver genes. Although genes identified are mostly true driver genes, these methods are not sufficient to identify all driver genes. The mutation rate alone is not sufficient to identify driver genes.
2. The second class of methods use a ratio-metric approach. Here, not only are the repeated occurrences of mutations taken into consideration, but more importantly, the functional impact of the mutations are considered as well.
While these methods capture some mutation patterns observed across samples, low recall shows that the understanding of the characteristics that define TSGs and OGs are far from complete.
In this study, new features were used to calculate entropy and frequency of different mutation types along with other ratio-metric features. The aim of this study is to identify important features for TSGs and OGs that can help classify a given gene as a TSG or an OG. A method was outlined for estimating parameters for the given classification algorithm. cTaG was used to predict new pan-cancer driver genes by classifying a list of unlabelled genes. Further, the pan-cancer model was used to identify tissue-specific driver genes, previously missed in the pan-cancer analysis.
The predictions of the authors were validated by illustrating the presence of known TSGs and OGs along with new driver genes. Functional analysis as well as literature was used to further validate new driver genes. The predictions were also compared with the widely used tool to detect driver genes called MutSigCV. It was found that cTaG not only identified known driver genes, but also unknown driver genes which had low mutation rates.
Although the use of cTaG was overall successful, some challenges remain to be addressed. Only binary classification was done for identifying TSGs and OGs. Also, all genes containing mutations are not necessarily driver genes. Thus, a majority of genes are neutral. Nevertheless, there was an improved recall of TSGs and OGs compared to previously proposed methods. Importantly, many potential TSGs and OGs were predicted which would be useful for further experimental investigation. The authors plan to extend their work further, to address key challenges, such as the personalised prediction of driver mutations in a cancer-specific fashion.
Dr. Sabarinathan Radhakrishnan from the National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bengaluru, India, gave the following views on this paper: “The cancer genome data provides a gateway to explore and study the pattern of mutations underlying the cancer driver genes and their mode of action (oncogene or tumour suppressor). In this study, the authors have employed machine learning approaches to mine somatic mutations from multiple cancer genomes across different tissue types and determined distinct mutational features associated with the driver genes (TSG and OG). Further, they developed a pan-cancer model (after careful evaluation of the derived features and overcame problems with overfitting) to predict novel cancer driver genes, their mode of action and tissue specificity (if any). One of the main advantages of this approach is that it can help to discover rare driver genes (that occur in low frequency), which could be missed by the mutation frequency-based driver detection approaches.”
The authors are affiliate members of the Centre for Integrative Biology and Systems mEdicine (IBSE) and the Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI).
Article by Akshay Anantharaman
Here is the original link to the paper:
https://www.nature.com/articles/s41598-021-04015-y