Guennouni Assimi, Mme Salma (2023) PRE - Research Project, ENSTA.



Causal discovery is an essential pursuit across multiple scientific disciplines, including genetics. It involves exploring the hidden cause-and-effect connections. Directed Acyclic Graphs (DAGs) play a significant role in this process, serving as crucial mathematical components in various machine learning tasks. One prominent application of DAGs is their ability to represent causal relationships among variables. In this representation, variables are depicted as nodes, while directed edges indicate the causal connections. As such, learning the DAG structure, i.e., the existence and direction of the edges, is essential in causal discovery within fields such as biology, economics, and planning. The most used method for discovering causal relationships is using data obtained from carefully controlled and randomized experiments. However, obtaining experimental data of such nature can be difficult or ethically impractical in many scenarios. Consequently, a scenario that is more common, but also more challenging, involves DAG learning from observational data. Additionally, the number of potential DAGs grows super-exponentially with the number of variables, which makes DAG learning a NP-hard problem. To tackle those challenges, new methods have been developed. More precisely, this report will be mainly based on the work of Charpentier Bertrand, Kibler Simon, and Gunnemann Stephan “Differentiable DAG sampling”(2022), and their software implementation in Python and PyTorch, in which they propose a new differentiable probabilistic model over DAGs (DP-DAG) and a new method for DAG learning which combines DP-DAG with Variational Inference. This novel approach significantly improves the quality and speed of the existing methods. Their results are promising. Nevertheless, their algorithm was originally proposed for Gaussian (normal) data. Therefore, to overcome this limitation, we explored the possibility to extend it to known parametric distributions, and in particular to count data, e.g., Poisson, Negative Binomial and Zero-inflated models, that can be useful in genetics. More particularly, many models for Single-cell RNA sequencing (scRNA-seq) data can be derived using Poisson, combined - if necessary - with additional complications such as Negative Binomial (NB) or Zero-Inflated Negative Binomial (ZINB) to capture additional variation [16]. The aim of my Research Internship is precisely to extend the algorithm to accommodate such count data, and this will be further explored in this Research report.

Item Type:Thesis (PRE - Research Project)
Uncontrolled Keywords:Variational Autoencoder
Subjects:Information and Communication Sciences and Technologies
Mathematics and Applications
ID Code:9596
Deposited On:28 août 2023 09:41
Dernière modification:28 août 2023 09:41

Repository Staff Only: item control page