Ghattas

Badih Ghattas

Chercheurs

Aix-Marseille Université

Faculté d'économie et de gestion (FEG)

Économétrie, finance et méthodes mathématiques

Statut

Professeur des universités

Thèse

2000

Université de la Méditerranée

Contact

badih.ghattas[at]univ-amu.fr

Adresse

Îlot Bernard du Bois

AMU - AMSE
5-9 Boulevard Maurice Bourdet, CS 50498
13205 Marseille Cedex 1

Site

http://ghattas.free.fr/

Publications

Publications

Clustering Approaches for Mixed-Type Data: A Comparative StudyJournal articleBadih Ghattas et Alvaro Sanchez San-Benito, Journal of Probability and Statistics, Volume 2025, Issue 1, pp. 2242100, 2025

Clustering is widely used in unsupervised learning to find homogeneous groups of observations within a dataset. However, clustering mixed-type data remains a challenge, as few existing approaches are suited for this task. This study presents the state-of-the-art of these approaches and compares them using various simulation models. The compared methods include the distance-based approaches k-prototypes, PDQ, and convex k-means, and the probabilistic methods KAy-means for MIxed LArge data (KAMILA), the mixture of Bayesian networks (MBNs), and latent class model (LCM). The aim is to provide insights into the behavior of different methods across a wide range of scenarios by varying some experimental factors such as the number of clusters, cluster overlap, sample size, dimension, proportion of continuous variables in the dataset, and clusters’ distribution. The degree of cluster overlap and the proportion of continuous variables in the dataset and the sample size have a significant impact on the observed performances. When strong interactions exist between variables alongside an explicit dependence on cluster membership, none of the evaluated methods demonstrated satisfactory performance. In our experiments KAMILA, LCM, and k-prototypes exhibited the best performance, with respect to the adjusted rand index (ARI). All the methods are available in R.

Fully automated epicardial adipose tissue volume quantification with deep learning and relationship with CAC score and micro/macrovascular complications in people living with type 2 diabetes: the multicenter EPIDIAB studyJournal articleBénédicte Gaborit, Jean Baptiste Julla, Joris Fournel, Patricia Ancel, Astrid Soghomonian, Camille Deprade, Adèle Lasbleiz, Marie Houssays, Badih Ghattas, Pierre Gascon, et al., Cardiovascular Diabetology, Volume 23, Issue 1, pp. 328, 2024

The aim of this study (EPIDIAB) was to assess the relationship between epicardial adipose tissue (EAT) and the micro and macrovascular complications (MVC) of type 2 diabetes (T2D).

Textual data for electricity load forecastingJournal articleDavid Obst, Sandra Claudel, Jairo Cugliari, Badih Ghattas, Yannig Goude et Georges Oppenheim, Quality and Reliability Engineering International, Volume 40, Issue 8, pp. 4187-4208, 2024

Traditional mid-term electricity forecasting models rely on calendar and meteorological information such as temperature and wind speed to achieve high performance. However depending on such variables has drawbacks, as they may not be informative enough during extreme weather. While ubiquitous, textual sources of information are hardly included in prediction algorithms for time series, despite the relevant information they may contain. In this work, we propose to leverage openly accessible weather reports for electricity demand and meteorological time series prediction problems. Our experiments on French and British load data show that the considered textual sources allow to improve overall accuracy of the reference model, particularly during extreme weather events such as storms or abnormal temperatures. Additionally, we apply our approach to the problem of imputation of missing values in meteorological time series, and we show that our text-based approach beats standard methods. Furthermore, the influence of words on the time series' predictions can be interpreted for the considered encoding schemes of the text, leading to a greater confidence in our results.

Left Ventricular Trabeculations at Cardiac MRI: Reference Ranges and Association with Cardiovascular Risk Factors in UK BiobankJournal articleNay Aung, Axel Bartoli, Elisa Rauseo, Sébastien Cortaredona, Mihir M. Sanghvi, Joris Fournel, Badih Ghattas, Mohammed Y. Khanji, Steffen E. Petersen et Alexis Jacquier, Radiology, Volume 311, Issue 1, pp. e232455, 2024

BackgroundThe extent of left ventricular (LV) trabeculation and its relationship with cardiovascular (CV) risk factors is unclear.PurposeTo apply automated segmentation to UK Biobank cardiac MRI scans to (a) assess the association between individual characteristics and CV risk factors and trabeculated LV mass (LVM) and (b) establish normal reference ranges in a selected group of healthy UK Biobank participants.Materials and MethodsIn this cross-sectional secondary analysis, prospectively collected data from the UK Biobank (2006 to 2010) were retrospectively analyzed. Automated segmentation of trabeculations was performed using a deep learning algorithm. After excluding individuals with known CV diseases, White adults without CV risk factors (reference group) and those with preexisting CV risk factors (hypertension, hyperlipidemia, diabetes mellitus, or smoking) (exposed group) were compared. Multivariable regression models, adjusted for potential confounders (age, sex, and height), were fitted to evaluate the associations between individual characteristics and CV risk factors and trabeculated LVM.ResultsOf 43 038 participants (mean age, 64 years ± 8 [SD]; 22 360 women), 28 672 individuals (mean age, 66 years ± 7; 14 918 men) were included in the exposed group, and 7384 individuals (mean age, 60 years ± 7; 4729 women) were included in the reference group. Higher body mass index (BMI) (β = 0.66 [95% CI: 0.63, 0.68]; P < .001), hypertension (β = 0.42 [95% CI: 0.36, 0.48]; P < .001), and higher physical activity level (β = 0.15 [95% CI: 0.12, 0.17]; P < .001) were associated with higher trabeculated LVM. In the reference group, the median trabeculated LVM was 6.3 g (IQR, 4.7–8.5 g) for men and 4.6 g (IQR, 3.4–6.0 g) for women. Median trabeculated LVM decreased with age for men from 6.5 g (IQR, 4.8–8.7 g) at age 45–50 years to 5.9 g (IQR, 4.3–7.8 g) at age 71–80 years (P = .03).ConclusionHigher trabeculated LVM was observed with hypertension, higher BMI, and higher physical activity level. Age- and sex-specific reference ranges of trabeculated LVM in a healthy middle-aged White population were established.© RSNA, 2024Supplemental material is available for this article.See also the editorial by Kawel-Boehm in this issue.

Subsampling under distributional constraintsJournal articleFlorian Combes, Ricardo Fraiman et Badih Ghattas, Statistical Analysis and Data Mining: The ASA Data Science Journal, Volume 17, Issue 1, pp. e11661, 2024

Some complex models are frequently employed to describe physical and mechanical phenomena. In this setting, we have an input X\ X \ in a general space, and an output Y=f(X)\ Y=f(X) \ where f\ f \ is a very complicated function, whose computational cost for every new input is very high, and may be also very expensive. We are given two sets of observations of X\ X \, S1\ S_1 \ and S2\ S_2 \ of different sizes such that only fS1\ f\left(S_1\right) \ is available. We tackle the problem of selecting a subset S3⊂S2\ S_3\subset S_2 \ of smaller size on which to run the complex model f\ f \, and such that the empirical distribution of fS3\ f\left(S_3\right) \ is close to that of fS1\ f\left(S_1\right) \. We suggest three algorithms to solve this problem and show their efficiency using simulated datasets and the Airfoil self-noise data set.

Finding the best trade-off between performance and interpretability in predicting hospital length of stay using structured and unstructured dataJournal articleFranck Jaotombo, Luca Adorni, Badih Ghattas et Laurent Boyer, PLoS ONE, Volume 18, Issue 11, pp. e0289795, 2023

Objective This study aims to develop high-performing Machine Learning and Deep Learning models in predicting hospital length of stay (LOS) while enhancing interpretability. We compare performance and interpretability of models trained only on structured tabular data with models trained only on unstructured clinical text data, and on mixed data. Methods The structured data was used to train fourteen classical Machine Learning models including advanced ensemble trees, neural networks and k-nearest neighbors. The unstructured data was used to fine-tune a pre-trained Bio Clinical BERT Transformer Deep Learning model. The structured and unstructured data were then merged into a tabular dataset after vectorization of the clinical text and a dimensional reduction through Latent Dirichlet Allocation. The study used the free and publicly available Medical Information Mart for Intensive Care (MIMIC) III database, on the open AutoML Library AutoGluon. Performance is evaluated with respect to two types of random classifiers, used as baselines. Results The best model from structured data demonstrates high performance (ROC AUC = 0.944, PRC AUC = 0.655) with limited interpretability, where the most important predictors of prolonged LOS are the level of blood urea nitrogen and of platelets. The Transformer model displays a good but lower performance (ROC AUC = 0.842, PRC AUC = 0.375) with a richer array of interpretability by providing more specific in-hospital factors including procedures, conditions, and medical history. The best model trained on mixed data satisfies both a high level of performance (ROC AUC = 0.963, PRC AUC = 0.746) and a much larger scope in interpretability including pathologies of the intestine, the colon, and the blood; infectious diseases, respiratory problems, procedures involving sedation and intubation, and vascular surgery. Conclusions Our results outperform most of the state-of-the-art models in LOS prediction both in terms of performance and of interpretability. Data fusion between structured and unstructured text data may significantly improve performance and interpretability.

Machine Learning Alternatives to Response Surface ModelsJournal articleBadih Ghattas et Diane Manzon, Mathematics, Volume 11, Issue 15, pp. 3406, 2023

In the Design of Experiments, we seek to relate response variables to explanatory factors. Response Surface methodology (RSM) approximates the relation between output variables and a polynomial transform of the explanatory variables using a linear model. Some researchers have tried to adjust other types of models, mainly nonlinear and nonparametric. We present a large panel of Machine Learning approaches that may be good alternatives to the classical RSM approximation. The state of the art of such approaches is given, including classification and regression trees, ensemble methods, support vector machines, neural networks and also direct multi-output approaches. We survey the subject and illustrate the use of ten such approaches using simulations and a real use case. In our simulations, the underlying model is linear in the explanatory factors for one response and nonlinear for the others. We focus on the advantages and disadvantages of the different approaches and show how their hyperparameters may be tuned. Our simulations show that even when the underlying relation between the response and the explanatory variables is linear, the RSM approach is outperformed by the direct neural network multivariate model, for any sample size (<50) and much more for very small samples (15 or 20). When the underlying relation is nonlinear, the RSM approach is outperformed by most of the machine learning approaches for small samples (n ≤ 30).

Looking for a hyper polyhedron within the multidimensional space of Design Space from the results of Designs of ExperimentsJournal articleDiane Manzon, Badih Ghattas, Magalie Claeys-Bruno, Sophie Declomesnil, Christophe Carité et Michelle Sergent, Chemometrics and Intelligent Laboratory Systems, Volume 232, pp. 104712, 2023

In pharmaceutical studies, the Quality by Design (QbD) approach is increasingly being implemented to improve product development. Product quality is tested at each step of the manufacturing process, allowing a better process understanding and a better risk management, thus avoiding manufacturing defects. A key element of QbD is the construction of a Design Space (DS), i.e., a region in which the specifications on the output parameters should be met. Among the various possible construction methods, Designs of Experiments (DoE), and more precisely Response Surface Methodology, represent a perfectly adapted tool. The DS obtained may have any geometrical shape; consequently, the acceptable variation range of an input may depend on the value of other inputs. However, the experimenters would like to directly know the variation range of each input so that their variation domains are independent. In this context, we developed a method to determine the “Proven Acceptable Independent Range” (PAIR). It consists of looking for all the hyper polyhedra included in the multidimensional DS and selecting a hyper polyhedron according to various strategies. We will illustrate the performance of our method on different DoE cases.

We introduce a new clustering procedure specialized for Big Data. It is inspired by the work of [1], and applies a MapReduce procedure for any base clustering algorithm, split-ting the data set at hand, clustering subsamples, and combining intermediate results. We use thus a high level parallelization running a base clustering approach on small samples. We analyse in detail our approach exploring various alternatives and showing its efficiency by simulations.

Modernizing quality of life assessment: development of a multidimensional computerized adaptive questionnaire for patients with schizophreniaJournal articlePierre Michel, Karine Baumstarck, Christophe Lançon, Badih Ghattas, Anderson Loundou, Pascal Auquier et Laurent Boyer, Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabilitation, Volume 27, Issue 4, pp. 1041-1054, 2018

OBJECTIVE: Quality of life (QoL) is still assessed using paper-based and fixed-length questionnaires, which is one reason why QoL measurements have not been routinely implemented in clinical practice. Providing new QoL measures that combine computer technology with modern measurement theory may enhance their clinical use. The aim of this study was to develop a QoL multidimensional computerized adaptive test (MCAT), the SQoL-MCAT, from the fixed-length SQoL questionnaire for patients with schizophrenia.
METHODS: In this multicentre cross-sectional study, we collected sociodemographic information, clinical characteristics (i.e., duration of illness, the PANSS, and the Calgary Depression Scale), and quality of life (i.e., SQoL). The development of the SQoL-CAT was divided into three stages: (1) multidimensional item response theory (MIRT) analysis, (2) multidimensional computerized adaptive test (MCAT) simulations with analyses of accuracy and precision, and (3) external validity.
RESULTS: Five hundred and seventeen patients participated in this study. The MIRT analysis found that all items displayed good fit with the multidimensional graded response model, with satisfactory reliability for each dimension. The SQoL-MCAT was 39% shorter than the fixed-length SQoL questionnaire and had satisfactory accuracy (levels of correlation >0.9) and precision (standard error of measurement <0.55 and root mean square error <0.3). External validity was confirmed via correlations between the SQoL-MCAT dimension scores and symptomatology scores.
CONCLUSION: The SQoL-MCAT is the first computerized adaptive QoL questionnaire for patients with schizophrenia. Tailored for patient characteristics and significantly shorter than the paper-based version, the SQoL-MCAT may improve the feasibility of assessing QoL in clinical practice.