Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies

Overview of attention for article published in BMC Bioinformatics, February 2017

Altmetric Badge

Citations

dimensions_citation: 60 Dimensions

Readers on

mendeley: 57 Mendeley
citeulike: 1 CiteULike

Summary Dimensions citations

You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.

Title	Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies
Published in	BMC Bioinformatics, February 2017
DOI	10.1186/s12859-017-1547-6
Pubmed ID	28219348
Authors	Jasmit S. Shah, Shesh N. Rai, Andrew P. DeFilippis, Bradford G. Hill, Aruni Bhatnagar, Guy N. Brock
Abstract	High throughput metabolomics makes it possible to measure the relative abundances of numerous metabolites in biological samples, which is useful to many areas of biomedical research. However, missing values (MVs) in metabolomics datasets are common and can arise due to both technical and biological reasons. Typically, such MVs are substituted by a minimum value, which may lead to different results in downstream analyses. Here we present a modified version of the K-nearest neighbor (KNN) approach which accounts for truncation at the minimum value, i.e., KNN truncation (KNN-TN). We compare imputation results based on KNN-TN with results from other KNN approaches such as KNN based on correlation (KNN-CR) and KNN based on Euclidean distance (KNN-EU). Our approach assumes that the data follow a truncated normal distribution with the truncation point at the detection limit (LOD). The effectiveness of each approach was analyzed by the root mean square error (RMSE) measure as well as the metabolite list concordance index (MLCI) for influence on downstream statistical testing. Through extensive simulation studies and application to three real data sets, we show that KNN-TN has lower RMSE values compared to the other two KNN procedures as well as simpler imputation methods based on substituting missing values with the metabolite mean, zero values, or the LOD. MLCI values between KNN-TN and KNN-EU were roughly equivalent, and superior to the other four methods in most cases. Our findings demonstrate that KNN-TN generally has improved performance in imputing the missing values of the different datasets compared to KNN-CR and KNN-EU when there is missingness due to missing at random combined with an LOD. The results shown in this study are in the field of metabolomics but this method could be applicable with any high throughput technology which has missing due to LOD.

View on publisher site Alert me about new mentions

Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 57 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Unknown	57	100%

Demographic breakdown

Readers by professional status	Count	As %
Student > Ph. D. Student	18	32%
Researcher	10	18%
Student > Master	3	5%
Student > Doctoral Student	3	5%
Student > Bachelor	2	4%
Other	10	18%
Unknown	11	19%

Readers by discipline	Count	As %
Medicine and Dentistry	8	14%
Agricultural and Biological Sciences	8	14%
Biochemistry, Genetics and Molecular Biology	6	11%
Mathematics	6	11%
Engineering	5	9%
Other	16	28%
Unknown	8	14%