A filter approach for feature selection in classification: application to automatic atrial fibrillation detection in electrocardiogram recordingsJournal articlePierre Michel, Nicolas Ngo, Jean-François Pons, Stéphane Delliaux and Roch Giorgi, Bmc Medical Informatics and Decision Making, Volume 21, Issue Suppl 4, pp. 130, 2021

In high-dimensional data analysis, the complexity of predictive models can be reduced by selecting the most relevant features, which is crucial to reduce data noise and increase model accuracy and interpretability. Thus, in the field of clinical decision making, only the most relevant features from a set of medical descriptors should be considered when determining whether a patient is healthy or not. This statistical approach known as feature selection can be performed through regression or classification, in a supervised or unsupervised manner. Several feature selection approaches using different mathematical concepts have been described in the literature. In the field of classification, a new approach has recently been proposed that uses the y-metric, an index measuring separability between different classes in heart rhythm characterization. The present study proposes a filter approach for feature selection in classification using this y-metric, and evaluates its application to automatic atrial fibrillation detection.

The stability and prediction performance of the [Formula: see text]-metric feature selection approach was evaluated using the support vector machine model on two heart rhythm datasets, one extracted from the PhysioNet database and the other from the database of Marseille University Hospital Center, France (Timone Hospital). Both datasets contained electrocardiogram recordings grouped into two classes: normal sinus rhythm and atrial fibrillation. The performance of this feature selection approach was compared to that of three other approaches, with the first two based on the Random Forest technique and the other on receiver operating characteristic curve analysis.

The [Formula: see text]-metric approach showed satisfactory results, especially for models with a smaller number of features. For the training dataset, all prediction indicators were higher for our approach (accuracy greater than 99% for models with 5 to 17 features), as was stability (greater than 0.925 regardless of the number of features included in the model). For the validation dataset, the features selected with the y-metric approach differed from those selected with the other approaches; sensitivity was higher for our approach, but other indicators were similar.

This filter approach for feature selection in classification opens up new methodological avenues for atrial fibrillation detection using short electrocardiogram recordings.

Application of Functional Data Analysis to Identify Patterns of Malaria Incidence, to Guide Targeted Control StrategiesJournal articleSokhna Dieng, Pierre Michel, Abdoulaye Guindo, Kankoe Sallah, El-Hadj Ba, Badara Cissé, Maria Patrizia Carrieri, Cheikh Sokhna, Paul Milligan and Jean Gaudart, International Journal of Environmental Research and Public Health, Volume 17, Issue 11, pp. 4168, 2020

We introduce an approach based on functional data analysis to identify patterns of malaria incidence to guide effective targeting of malaria control in a seasonal transmission area. Using functional data method, a smooth function (functional data or curve) was fitted from the time series of observed malaria incidence for each of 575 villages in west-central Senegal from 2008 to 2012. These 575 smooth functions were classified using hierarchical clustering (Ward’s method), and several different dissimilarity measures. Validity indices were used to determine the number of distinct temporal patterns of malaria incidence. Epidemiological indicators characterizing the resulting malaria incidence patterns were determined from the velocity and acceleration of their incidences over time. We identified three distinct patterns of malaria incidence: high-, intermediate-, and low-incidence patterns in respectively 2% (12/575), 17% (97/575), and 81% (466/575) of villages. Epidemiological indicators characterizing the fluctuations in malaria incidence showed that seasonal outbreaks started later, and ended earlier, in the low-incidence pattern. Functional data analysis can be used to identify patterns of malaria incidence, by considering their temporal dynamics. Epidemiological indicators derived from their velocities and accelerations, may guide to target control measures according to patterns.

The Patient-Reported Experience Measure for Improving qUality of care in Mental health (PREMIUM) project in France: study protocol for the development and implementation strategyJournal articleSara Fernandes, Guillaume Fond, Xavier Zendjidjian, Pierre Michel, Karine Baumstarck, Christophe Lançon, Fabrice Berna, Franck Schurhoff, Bruno Aouizerate, Chantal Henry, et al., Patient Preference and Adherence, Volume 13, pp. 165-177, 2019

Measuring the quality and performance of health care is a major challenge in improving the efficiency of a health system. Patient experience is one important measure of the quality of health care, and the use of patient-reported experience measures (PREMs) is recommended. The aims of this project are 1) to develop item banks of PREMs that assess the quality of health care for adult patients with psychiatric disorders (schizophrenia, bipolar disorder, and depression) and to validate computerized adaptive testing (CAT) to support the routine use of PREMs; and 2) to analyze the implementation and acceptability of the CAT among patients, professionals, and health authorities.

This multicenter and cross-sectional study is based on a mixed method approach, integrating qualitative and quantitative methodologies in two main phases: 1) item bank and CAT development based on a standardized procedure, including conceptual work and definition of the domain mapping, item selection, calibration of the item bank and CAT simulations to elaborate the administration algorithm, and CAT validation; and 2) a qualitative study exploring the implementation and acceptability of the CAT among patients, professionals, and health authorities.

The development of a set of PREMs on quality of care in mental health that overcomes the limitations of previous works (ie, allowing national comparisons regardless of the characteristics of patients and care and based on modern testing using item banks and CAT) could help health care professionals and health system policymakers to identify strategies to improve the quality and efficiency of mental health care.

Assessing variable importance in clustering: a new method based on unsupervised binary decision treesJournal articleGhattas Badih, Pierre Michel and Boyer Laurent, Computational Statistics, Volume 34, Issue 1, pp. 301-321, 2019

We consider different approaches for assessing variable importance in clustering. We focus on clustering using binary decision trees (CUBT), which is a non-parametric top-down hierarchical clustering method designed for both continuous and nominal data. We suggest a measure of variable importance for this method similar to the one used in Breiman’s classification and regression trees. This score is useful to rank the variables in a dataset, to determine which variables are the most important or to detect the irrelevant ones. We analyze both stability and efficiency of this score on different data simulation models in the presence of noise, and compare it to other classical variable importance measures. Our experiments show that variable importance based on CUBT is much more efficient than other approaches in a large variety of situations.

Analyse du discours médical sur Twitter®. Étude d’un corpus de tweets émis par des médecins généralistes entre juin 2012 et mars 2017 et contenant le hashtag #DocTocTocJournal articleAdrien Salles, Jean-Charles Dufour, P. Hassanaly, Pierre Michel, Chloé Cabot and Julien Grosjean, Revue d'Épidémiologie et de Santé Publique, Volume 67, Issue 3, pp. S152-S153, 2019

Les technologies de l’information et de la communication ont permis la naissance du web 2.0, caractérisé par la mise en place et l’utilisation de nouveaux outils collaboratifs de communication tels que les blogs, les wikis, les fils RSS et les réseaux sociaux. En s’appropriant ces outils, une médecine participative basée sur le partage d’informations et d’expériences entre professionnels, patients et tout acteur de la santé s’est développée. Depuis juin 2012, une communauté médicale échange sur Twitter avec le hashtag #DocTocToc et contribue à la naissance de la e-santé sur ce réseau social. L’objectif de cette étude est d’analyser les principales thématiques des demandes effectuées via le hashtag #DocTocToc par les médecins généralistes entre juin 2012 et mars 2017.

Une collecte de données par une méthode de « web scraping » a permis de constituer un corpus de tweets dont les auteurs ont été identifiés manuellement afin de procéder à un échantillonnage, de façon à ne conserver que les tweets émis par les médecins généralistes. Une étape de prétraitement a permis de transformer les formes potentiellement non reconnues par les logiciels de traitement du langage naturel. Le corpus a été appréhendé à l’aide de deux approches : une approche lexicale via le logiciel Iramuteq® et une indexation terminologique par l’extracteur de concepts multi-terminologiques (ECMT) du Catalogue et index des sites médicaux francophones (CISMeF).

Sur les 12 716 tweets recueillis, 7366 étaient rédigés par des médecins généralistes et ont été analysés. L’approche lexicale détermine deux grands mondes lexicaux représentés sous forme de dendrogramme, l’un en lien avec les demandes médico administratives relatives à la gestion du cabinet et à la prise en charge sociale du patient, l’autre en lien avec les demandes d’ordre purement médicales. La méthode d’indexation terminologique met en évidence les spécialités médicales pourvoyeuses de demandes de télé-expertise : gynécologie, neurologie, infectiologie, pédiatrie, cardiologie, dermatologie ; et permet de les croiser avec l’objectif de la demande : diagnostic, thérapeutique.

Sur Twitter®, le hashtag #DocTocToc est utilisé par les médecins généralistes comme un espace de partage informel d’informations en matière de santé mais aussi de gestion de problèmes administratifs et sociaux. Le DocsTocToc se présente comme un groupe d’échange de pratique à grande échelle ou le médecin compte sur l’avis de ses pairs.(Fig. 1)

Predicting musculoskeletal disorders risk using tree-based ensemble methodsJournal articleAlain Paraponaris, A. Ba, Ewen Gallic, Q. Liance and Pierre Michel, European Journal of Public Health, Volume 29, Issue Supplement_4, 2019

Musculoskeletal disorders (MSD) can cause short-term disorders and permanent disabilities which may all result in serious limitations in ac

Computerized adaptive testing with decision regression trees: an alternative to item response theory for quality of life measurement in multiple sclerosisJournal articlePierre Michel, Karine Baumstarck, Anderson Loundou, Badih Ghattas, Pascal Auquier and Laurent Boyer, Patient Preference and Adherence, Volume 12, pp. 1043-1053, 2018

The aim of this study was to propose an alternative approach to item response theory (IRT) in the development of computerized adaptive testing (CAT) in quality of life (QoL) for patients with multiple sclerosis (MS). This approach relied on decision regression trees (DRTs). A comparison with IRT was undertaken based on precision and validity properties.

Materials and methods:
DRT- and IRT-based CATs were applied on items from a unidi-mensional item bank measuring QoL related to mental health in MS. The DRT-based approach consisted of CAT simulations based on a minsplit parameter that defines the minimal size of nodes in a tree. The IRT-based approach consisted of CAT simulations based on a specified level of measurement precision. The best CAT simulation showed the lowest number of items and the best levels of precision. Validity of the CAT was examined using sociodemographic, clinical and QoL data.

CAT simulations were performed using the responses of 1,992 MS patients. The DRT-based CAT algorithm with minsplit = 10 was the most satisfactory model, superior to the best IRT-based CAT algorithm. This CAT administered an average of nine items and showed satisfactory precision indicators (R = 0.98, root mean square error [RMSE] = 0.18). The DRT-based CAT showed convergent validity as its score correlated significantly with other QoL scores and showed satisfactory discriminant validity.
Conclusion: We presented a new adaptive testing algorithm based on DRT, which has equivalent level of performance to IRT-based approach. The use of DRT is a natural and intuitive way to develop CAT, and this approach may be an alternative to IRT.

Clustering based on unsupervised binary trees to define subgroups of cancer patients according to symptom severity in cancerJournal articlePierre Michel, Zeinab Hamidou, Karine Baumstarck, Badih Ghattas, Noémie Resseguier, Olivier Chinot, Fabrice Barlesi, Sébastien Salas, Laurent Boyer and Pascal Auquier, Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabilitation, Volume 27, Issue 2, pp. 555-565, 2018

Studies have suggested that clinicians do not feel comfortable with the interpretation of symptom severity, functional status, and quality of life (QoL). Implementation strategies of these types of measurements in clinical practice imply that consensual norms and guidelines regarding data interpretation are available. The aim of this study was to define subgroups of patients according to the levels of symptom severity using a method of interpretable clustering that uses unsupervised binary trees.

The patients were classified using a top-down hierarchical method: Clustering using Unsupervised Binary Trees (CUBT). We considered a three-group structure: "high", "moderate", and "low" level of symptom severity. The clustering tree was based on three stages using the 9-symptom scale scores of the EORTC QLQ-C30: a maximal tree was first developed by applying a recursive partitioning algorithm; the tree was then pruned using a criterion of minimal dissimilarity; finally, the most similar clusters were joined together. Inter-cluster comparisons were performed to test the sample partition and QoL data.

Two hundred thirty-five patients with different types of cancer were included. The three-cluster structure classified 143 patients with "low", 46 with "moderate", and 46 with "high" levels of symptom severity. This partition was explained by cut-off values on Fatigue and Appetite Loss scores. The three clusters consistently differentiated patients based on the clinical characteristics and QoL outcomes.

Our study suggests that CUBT is relevant to define the levels of symptom severity in cancer. This finding may have important implications for helping clinicians to interpret symptom profiles in clinical practice, to identify individuals at risk for poorer outcomes and implement targeted interventions.

Modernizing quality of life assessment: development of a multidimensional computerized adaptive questionnaire for patients with schizophreniaJournal articlePierre Michel, Karine Baumstarck, Christophe Lançon, Badih Ghattas, Anderson Loundou, Pascal Auquier and Laurent Boyer, Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabilitation, Volume 27, Issue 4, pp. 1041-1054, 2018

OBJECTIVE: Quality of life (QoL) is still assessed using paper-based and fixed-length questionnaires, which is one reason why QoL measurements have not been routinely implemented in clinical practice. Providing new QoL measures that combine computer technology with modern measurement theory may enhance their clinical use. The aim of this study was to develop a QoL multidimensional computerized adaptive test (MCAT), the SQoL-MCAT, from the fixed-length SQoL questionnaire for patients with schizophrenia.
METHODS: In this multicentre cross-sectional study, we collected sociodemographic information, clinical characteristics (i.e., duration of illness, the PANSS, and the Calgary Depression Scale), and quality of life (i.e., SQoL). The development of the SQoL-CAT was divided into three stages: (1) multidimensional item response theory (MIRT) analysis, (2) multidimensional computerized adaptive test (MCAT) simulations with analyses of accuracy and precision, and (3) external validity.
RESULTS: Five hundred and seventeen patients participated in this study. The MIRT analysis found that all items displayed good fit with the multidimensional graded response model, with satisfactory reliability for each dimension. The SQoL-MCAT was 39% shorter than the fixed-length SQoL questionnaire and had satisfactory accuracy (levels of correlation >0.9) and precision (standard error of measurement <0.55 and root mean square error <0.3). External validity was confirmed via correlations between the SQoL-MCAT dimension scores and symptomatology scores.
CONCLUSION: The SQoL-MCAT is the first computerized adaptive QoL questionnaire for patients with schizophrenia. Tailored for patient characteristics and significantly shorter than the paper-based version, the SQoL-MCAT may improve the feasibility of assessing QoL in clinical practice.

Évaluation empirique d’une nouvelle méthode multivariée de sélection de variables en classification supervisée : la métrique γJournal articlePierre Michel, J. - F. Pons, R. Giorgi and Stéphane Delliaux, Revue d'Épidémiologie et de Santé Publique, Volume 66, Issue 3, pp. S137-S138, 2018

Introduction :
Dans l’analyse de données massives en santé, il est préférable de ne considérer que les variables les plus importantes pour un modèle donné afin de réduire les temps de calcul. Par exemple, pour qualifier l’état physiologique d’un patient à partir de descripteurs de nature médicale, seules les variables les plus pertinentes devraient être conservées afin d’améliorer l’aide à la décision clinique. Cette approche, appelée sélection de variables, peut être envisagée dans la régression ou la classification, de façon supervisée ou non supervisée. De nombreuses méthodes existent, reposant sur différentes approches ou métriques ayant des propriétés mathématiques spécifiques. Dans le cadre de la classification supervisée, une nouvelle méthode de sélection de variables basée sur un indice de séparabilité, la métrique γ a récemment été proposée (Pons et al., 2017). L’objectif de ce travail est d’étudier, de manière empirique, les performances de cette méthode.

Méthodes :
La métrique γ mesure la séparabilité entre plusieurs classes d’observations. Elle repose sur le calcul des vecteurs et valeurs propres de la matrice de covariance de chaque classe afin de sélectionner le sous-ensemble de variables qui maximise la séparabilité interclasse. Nous avons comparé cette métrique, par validation croisée, avec des méthodes classiques. Toutes les méthodes ont été appliquées sur trois jeux de données médicales de référence dans le domaine de la prédiction de diagnostic. Pour chaque jeu de données, nous avons évalué l’efficacité de cette méthode vis-à-vis de ses concurrentes, au regard d’indices de performance de classification et du nombre de variables sélectionnées.

Résultats :
Le Tableau 1 contient les moyennes des indices de performances obtenues pour chaque jeu de données. Les résultats de la validation croisée font apparaître une meilleure performance de la méthode basée sur la métrique γ, pour deux des trois jeux de données utilisés. Dans le cas des données de patients atteints de cancer, cette méthode est toujours meilleure que ses concurrentes en termes d’indices de performance et améliore le modèle contenant les variables initiales.

Conclusion :
Sur ces données empiriques servant régulièrement de banc de test, la métrique γ a obtenu de bonnes performances. Ces résultats préliminaires présentent un intérêt pour la mise en place future de stratégies de diagnostic automatique, basées sur d’autres types de données massives, issues par exemple d’objets connectés.