Anne Ruiz-Gazen
IBD Salle 21
AMU - AMSE
5-9 boulevard Maurice Bourdet
13001 Marseille
Michel Lubrano : michel.lubrano[at]univ-amu.fr
Pierre Michel : pierre.michel[at]univ-amu.fr
Combining survey sample data and big data is an important current challenge in finite population inference. While survey sample data are obtained through a probability sampling design, big data consist usually of non-probability samples. Many well-known unbiased or approximately unbiased estimation methods exist for estimating finite population parameters from a probability sample. Inference from a non-probability sample is, however, often subject to selection bias. Recently, a data integration approach has been proposed by Kim and Tam (2021) and incorporates a probability sample to handle the selection bias of non-probability samples. In the first part of the presentation, we propose to revisit their approach and study in detail the gain in terms of efficiency of some estimators when combining probability and non-probability samples. In the second part of the presentation, we focus on the case where the target variable is not observable in the big data source, while the auxiliary information, present in this source, is not measured in the probability sample. In such a situation, new estimators can be defined by following a prediction approach. These estimators are either design-based, model-based, or cosmetic. Their properties in terms of bias and efficiency are studied using theoretical and simulation results. The interest of the new estimators is illustrated in the context of the French postal service, where the objective is to estimate the monthly postal traffic by combining a survey of the mailmen rounds with the database containing the automatically processed postal mail.