Springer Series in Materials Science 225 Turab Lookman Francis Alexander Krishna Rajan Editors Information Science for Materials Discovery and Design Springer Series in Materials Science Volume 225 Series editors Robert Hull, Charlottesville, USA Chennupati Jagadish, Canberra, Australia Richard M. Osgood, New York, USA Jürgen Parisi, Oldenburg, Germany Tae-Yeon Seong, Seoul, Korea, Republic of (South Korea) Shin-ichi Uchida, Tokyo, Japan Zhiming M. Wang, Chengdu, China The Springer Series in Materials Science covers the complete spectrum of materials physics, including fundamental principles, physical properties, materials theory and design. Recognizing the increasing importance of materials science in future device technologies, the book titles in this series reflect the state-of-the-art in understanding and controlling the structure and properties of all important classes of materials. More information about this series at http://www.springer.com/series/856 Turab Lookman Francis J. Alexander Krishna Rajan • Editors Information Science for Materials Discovery and Design 123 Editors Turab Lookman Theoretical Division Los Alamos National Laboratory Los Alamos, NM USA Francis J. Alexander Computer and Computational Sciences Division Los Alamos National Laboratory Los Alamos, NM USA Krishna Rajan Department of Materials Design and Innovation University at Buffalo—The State University of New York Buffalo, NY USA ISSN 0933-033X ISSN 2196-2812 (electronic) Springer Series in Materials Science ISBN 978-3-319-23870-8 ISBN 978-3-319-23871-5 (eBook) DOI 10.1007/978-3-319-23871-5 Library of Congress Control Number: 2015952059 Springer Cham Heidelberg New York Dordrecht London © Springer International Publishing Switzerland 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com) Preface Accelerating materials discovery has been the theme of a number reports from the Department of Energy’s Office (DOE) of Basic Energy Science (BES), the National Science Foundation (NSF), the National Academies and other government agencies and professional societies. As a driver for accelerating materials discovery, the Materials Genome Initiative, announced by the President, is part of a bold plan to boost US manufacturing over the next few decades by halving the time it takes to discover and design new materials. In this plan, accelerating discovery relies on using in the material sciences large databases, computation, mathematics, and information science in a manner similar to the way that they were used to make the Human Genome Initiative a success for the biological sciences. Novel approaches are therefore being called for that can explore the enormous phase space presented by complex materials and processes. If we are to achieve the desired performance gains, then we must have a predictive capability that can guide experiments and computations in the most fruitful directions by reducing the possibilities that need to be tried. Despite advances in computational and experimental techniques to generate large volumes of data to screen the vast search space, it is clear that the outstanding challenge remains to integrate information-theoretic tools and materials knowledge, in the form of constraints imposed by theory, to develop, robust, predictive tools for materials design and discovery. The rapidly emerging field of materials informatics provides the critical methodology that enables the discovery, identification, and harnessing of the materials “genes” for accelerated materials discovery and design. We provide in this book a collection of articles in this nascent field which integrates contributions from the information sciences and materials communities. The collection is partly derived from a workshop held at Santa Fe, New Mexico, February 4–7, 2014 that was organized by the editors and sponsored with support from the Centers for Nonlinear Studies and Information Science and Technology at Los Alamos National Laboratory and National Science Foundation (Grant #: 13-07811). It outlines challenges and opportunities in the use of information-theoretic tools and v vi Preface evaluates the state of the art on a number of materials-motivated problems. Presented are contrasting but complementary approaches, such as those based on high-throughput calculations or experiments, as well as data-driven discovery, together with the merits and challenges of machine-learning and statistical inference methods to accommodate searches within a high dimensional feature space. The book is organized into three parts. In the first part, following a perspective of the state of the art in materials design and discovery, Chaps. 2–6 focus largely on information-theoretic tools and how they apply to specific materials problems. Chaps. 2 and 3 discuss how aspects of decision theory within a Bayesian framework can be used for optimal experimental design. In particular, Chap. 2 discusses how to decide on the best pair of experiments for inferring the parameters of a given model, as well as how to choose an experiment to distinguish between competing models. Chapter 3 discusses strategies based on methods for global optimization for choosing the next experiment to find a material with a desired property. Proceeding from problems involving regression to those requiring classification, Chap. 4 focuses on Bayesian methods for classifying objects, especially in the limit of small samples where classifier design procedures, which work well with large samples, can have problems when data is limited. The first part of this monograph is concluded with Chaps. 5 and 6 which deal with different aspects of clustering. Chapter 5 considers the effectiveness of data visualization algorithms that look for groupings of features and materials. Chapter 6 discusses how community detection, studied in statistical physics, can be used to partition a complex system into decoupled subsets at different spatial and temporal scales. The focus of the second part of the book, Chaps. 7–12, is the application of informatics tools to materials science problems. Chapter 7 discusses how parameters in the additive manufacturing process may be constrained by combining simulations and experiments using feature selection and data-driven models. Learning from high-throughput data generated from electronic structure calculations is the emphasis of Chaps. 8–11. Techniques such as principal component analysis (PCA), support vector regression (SVR), partial least squares, and Kriging using Gaussian process modeling suggest new features and materials with specified properties. Chapter 8 shows how suitable dopants in an oxide may be identified for increasing water-splitting processes. Applications in Chap. 9 include the discovery of cathode materials for lithium-ion batteries and thermoelectrics. Chapter 10 focuses on the layered compounds known as MAX phases, and Chap. 11 discusses ab initio methods and applied crystallography tools for descriptor development to establish structure–property relationships. Chapter 12 describes hybrid methods that integrate statistical learning techniques, to extract features from the density of states for predicting elastic properties, such as bulk modulus and yet unexplored chemistries. The third and final part, Chaps. 13 and 14, discusses high-throughput experiments, which generate large amounts of data. With appropriate characterization tools, the idea is to quickly identify the subspace of the large parameter space where a new compound with desired properties may be found. Such experiments, together with informatics tools, provide opportunities for “combinatorial materials science.” Preface vii Chap. 13 provides a review in the context of multifunctional materials, and Chap. 14 incorporates aspects of informatics with a focus on solar fuel applications and multicomponent oxide catalysts. The book is aimed at an interdisciplinary audience as the subject spans aspects of statistics, computer science, and materials science and will be of timely appeal to those interested in learning about this emerging field. We are grateful to all the authors for their articles as well as their support of the editorial process. Los Alamos, USA Los Alamos, USA Buffalo, NY Turab Lookman Francis J. Alexander Krishna Rajan Contents Part I 1 2 Data Analytics and Optimal Learning A Perspective on Materials Informatics: State-of-the-Art and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Lookman, P.V. Balachandran, D. Xue, G. Pilania, T. Shearman, J. Theiler, J.E. Gubernatis, J. Hogden, K. Barros, E. BenNaim and F.J. Alexander 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Statistical Inference and Design: Towards Accelerated Materials Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Progress and Concluding Remarks. . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information-Driven Experimental Design in Materials Science R. Aggarwal, M.J. Demkowicz and Y.M. Marzouk 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Tools of Optimal Experimental Design . . . . . . . . . . . 2.2.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Information Theoretic Objectives . . . . . . . . . . . . 2.2.3 Computational Considerations. . . . . . . . . . . . . . . 2.3 Examples of Optimal Experimental Design . . . . . . . . . . . . 2.3.1 Film-Substrate Systems: Design for Parameter Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Heterophase Interfaces: Design for Model Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 3 ... 4 ... ... ... 5 9 11 .... 13 . . . . . . . . . . . . 13 15 15 16 18 20 .... 21 .... .... .... 29 37 39 . . . . . . . . . . . . ix x 3 4 5 Contents Bayesian Optimization for Materials Design . . . . Peter I. Frazier and Jialei Wang 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Bayesian Optimization . . . . . . . . . . . . . . . . . 3.3 Gaussian Process Regression. . . . . . . . . . . . . 3.3.1 Choice of Covariance Function. . . . . 3.3.2 Choice of Mean Function. . . . . . . . . 3.3.3 Inference . . . . . . . . . . . . . . . . . . . . 3.3.4 Inference with Just One Observation . 3.3.5 Inference with Noisy Observations . . 3.3.6 Parameter Estimation. . . . . . . . . . . . 3.3.7 Diagnostics . . . . . . . . . . . . . . . . . . 3.3.8 Predicting at More Than One Point . . 3.3.9 Avoiding Matrix Inversion . . . . . . . . 3.4 Choosing Where to Sample . . . . . . . . . . . . . 3.4.1 Expected Improvement . . . . . . . . . . 3.4.2 Knowledge Gradient . . . . . . . . . . . . 3.4.3 Going Beyond One-Step Analyses, and Other Methods . . . . . . . . . . . . . 3.5 Software. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............. 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 47 47 49 51 51 53 54 57 58 60 61 61 62 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 69 69 73 ......... 77 . . . . . . . . . . 77 78 82 84 87 90 91 94 97 99 Small-Sample Classification. . . . . . . . . . . . . . . . . . . . . Lori A. Dalton and Edward R. Dougherty 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 MMSE Error Estimation. . . . . . . . . . . . . . . . . . . . 4.6 Optimal Bayesian Classification . . . . . . . . . . . . . . 4.7 The Gaussian Model . . . . . . . . . . . . . . . . . . . . . . 4.8 Optimal Bayesian Classifier in the Gaussian Model . 4.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Visualization and Structure Identification J.E. Gubernatis 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 5.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 The Piezo Data. . . . . . . . . . . . . . 5.3.2 The Pls Data . . . . . . . . . . . . . . . 5.3.3 The Tree Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 104 106 107 108 108 Contents xi 5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6 Inference of Hidden Structures in Complex Physical Systems by Multi-scale Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . Z. Nussinov, P. Ronhovde, Dandan Hu, S. Chakrabarty, Bo Sun, Nicholas A. Mauro and Kisor K. Sahu 6.1 The General Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Ensemble Minimization . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Community Detection and Data Mining . . . . . . . . . . . . . . 6.4 Multi-scale Community Detection . . . . . . . . . . . . . . . . . . 6.5 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Community Detection Phase Diagram . . . . . . . . . . . . . . . 6.7 Casting Complex Materials and Physical Systems as Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part II 7 8 . . . . 115 . . . . . . . . . . . . . . . . . . . . . . . . 116 117 118 121 123 126 . . . . 128 . . . . 133 . . . . 135 Materials Prediction with Data, Simulations and High-throughput Calculations On the Use of Data Mining Techniques to Build High-Density, Additively-Manufactured Parts . . . . . . . . . . . . . . . . . . . . . . . . Chandrika Kamath 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Additive Manufacturing Using Laser Powder-Bed Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Optimizing AM Parts for Density: The Current Approach . . 7.3 A Data Mining Approach Combining Experiments and Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Using Simple Simulations to Identify Viable Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Using Simple Experiments to Evaluate Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Determining Density by Building Small Pillars . . . . 7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 . . . 141 . . . 142 . . . 142 . . . 144 . . . 145 . . . . . . . . . . . . . . . 150 152 154 154 154 Optimal Dopant Selection for Water Splitting with Cerium Oxides: Mining and Screening First Principles Data. . . . . . . . . . . . 157 V. Botu, A.B. Mhadeshwar, S.L. Suib and R. Ramprasad 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.2 Screening Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 xii Contents 8.3 First Principles Studies. . . . . . . . . . . . 8.3.1 Methods and Models . . . . . . . 8.3.2 Enforcing the 3-Step Criteria . 8.4 Data Analysis . . . . . . . . . . . . . . . . . . 8.4.1 Principal Component Analysis 8.4.2 Random Forest . . . . . . . . . . . 8.5 Summary and Outlook . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toward Materials Discovery with First-Principles Datasets and Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . Isao Tanaka and Atsuto Seko 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 High Throughput Screening of DFT Data—Cathode Materials of Lithium ion Batteries . . . . . . . . . . . . . . . . 9.3 Combination of DFT Data and Machine Learning I—Melting Temperatures . . . . . . . . . . . . . . . 9.4 Combination of DFT Data and Machine Learning II—Lithium ion Conducting Oxides . . . . . . . . 9.5 Combination of DFT Data and Machine Learning III—Thermoelectric Materials . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 160 161 164 165 166 168 168 . . . . . . 173 . . . . . . 173 . . . . . . 175 . . . . . . 177 . . . . . . 182 . . . . . . 185 . . . . . . 186 10 Materials Informatics Using Ab initio Data: Application to MAX Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wai-Yim Ching 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 MAX Phases: A Unique Class of Material . . . . . . . . . . . . . 10.3 Applications of Materials Informatics to MAX Phases . . . . . 10.3.1 Initial Screening and Construction of the MAX Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Representative Results on Mechanical Properties and Electronic Structure of MAX . . . . . . . . . . . . . 10.3.3 Classification of Descriptors from the Database and Correlation Among Them . . . . . . . . . . . . . . . 10.3.4 Verification of the Efficacy of the Materials Informatics Tools . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Further Applications of MAX Data . . . . . . . . . . . . . . . . . . 10.4.1 Lattice Thermal Conductivity at High Temperature . 10.4.2 Universal Elastic Anisotropy in MAX Phases . . . . . 10.5 Extension to Other Materials Systems . . . . . . . . . . . . . . . . 10.5.1 MAX-Related Systems, MXenes, MAX Solid Solutions, and Similar Layered Structures . . . . . . . 10.5.2 CSH-Cement Crystals . . . . . . . . . . . . . . . . . . . . . . . . 187 . . . 187 . . . 189 . . . 191 . . . 191 . . . 192 . . . 197 . . . . . . . . . . . . . . . 198 201 201 203 205 . . . 205 . . . 206 Contents xiii 10.5.3 Extension to Other Materials Systems: Bulk Metallic Glasses and High Entropy Alloys . . . . . . . . . . . . . . . . 209 10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 11 Symmetry-Adapted Distortion Modes as Descriptors for Materials Informatics . . . . . . . . . . . . . . . . . . . . Prasanna V. Balachandran, Nicole A. Benedek and James M. Rondinelli 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Distortion Modes as Descriptors . . . . . . . . . . . . 11.3 Perovskite Nickelates . . . . . . . . . . . . . . . . . . . . 11.3.1 Statistical Correlation Analysis . . . . . . . 11.3.2 Principal Component Analysis (PCA) . . 11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Discovering Electronic Signatures for Phase Stability of Intermetallics via Machine Learning . . . . . . . . . . . . . . . Scott R. Broderick and Krishna Rajan 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Informatics Background and Data Processing . . . . . . . . 12.3 Informatics-Based Parameterization of the DOS Spectra . 12.4 Identifying the Bulk Modulus Fingerprint . . . . . . . . . . . 12.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 214 216 217 218 220 221 . . . . . . 223 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 224 228 233 237 237 Combinatorial Materials Science with High-throughput Measurements and Analysis 13 Combinatorial Materials Science, and a Perspective on Challenges in Data Acquisition, Analysis and Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert C. Pullar 13.1 Combinatorial Materials Science—20 Years of Progress? . 13.2 Combinatorial Materials Synthesis . . . . . . . . . . . . . . . . . 13.3 High-Throughput Measurement and Analysis . . . . . . . . . 13.4 Data Analysis and Presentation . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 . . . . . . . . . . . . . . . . . . . . . . . . . 242 247 254 261 267 14 High Throughput Combinatorial Experimentation + Informatics = Combinatorial Science . . . . . . . . . . . . . . . . . . . . . 271 Santosh K. Suram, Meyer Z. Pesenson and John M. Gregoire 14.1 Tailoring Material Function Through Material Complexity: The Utility of High Throughput and Combinatorial Methods . . . 272 14.2 Materials Datasets as an Instance of Big Data . . . . . . . . . . . . . . 272 xiv Contents 14.3 High Throughput Experimental Pipelines: The Example of Solar Fuels Materials Discovery . . . . . . . . . . . . . . . . . . 14.4 An Illustrative Dataset: Ni-Fe-Co-Ce Oxide Electrocatalysts for the Oxygen Evolution Reaction . . . . . . . . . . . . . . . . . . 14.5 Automating Sample Down-Selection for Maximal Information Retention: Clustering by Composition-Property Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5.1 Down-Selection for Maximal Information Content . 14.5.2 Information-Theoretic Approach . . . . . . . . . . . . . . 14.5.3 Genetic Programming Based Clustering . . . . . . . . . 14.5.4 Calculating Membership . . . . . . . . . . . . . . . . . . . 14.5.5 Application to a Synthetic Library. . . . . . . . . . . . . 14.5.6 Experimental Dataset. . . . . . . . . . . . . . . . . . . . . . 14.6 The Simplex Sample Space and Statistical Analysis of Compositional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6.1 The Closure Effects—Induced Correlation . . . . . . . 14.6.2 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . 14.6.3 Sub-Compositional Coherence . . . . . . . . . . . . . . . 14.6.4 Principled Analysis of Compositional Data. . . . . . . 14.6.5 Composition Spread and Distances . . . . . . . . . . . . 14.6.6 Interpolation of Compositional Data: Composition Profiles from Sputtering . . . . . . . . . . . . . . . . . . . . 14.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 . . . 276 . . . . . . . . . . . . . . . . . . . . . 277 279 280 282 283 284 285 . . . . . . . . . . . . . . . . . . 286 288 289 290 290 292 . . . 294 . . . 296 . . . 297 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Contributors Raghav Aggarwal Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA F.J. Alexander CCS Division, Los Alamos National Laboratory, Los Alamos, USA Prasanna V. Balachandran Theoretical Division, Los Alamos National Laboratory, Los Alamos, USA K. Barros Theoretical Division, T-1, Los Alamos National Laboratory, Los Alamos, USA E. BenNaim Theoretical Division, Los Alamos National Laboratory, Los Alamos, USA Nicole A. Benedek Department of Materials Science and Engineering, Cornell University, Ithaca, USA V. Botu Department of Chemical and Biomolecular Engineering, University of Connecticut, Storrs, CT, USA Scott R. Broderick Department of Materials Design and Innovation, University at Buffalo—The State University of New York, Buffalo, NY, USA S. Chakrabarty Department of Physics, Indian Institute of Science, Bangalore, India Wai-Yim Ching Curators Professor of Physics, Kansas City, MO, USA Lori A. Dalton The Ohio State University, Columbus, OH, USA M.J. Demkowicz Department of Materials Science and Massachusetts Institute of Technology, Cambridge, MA, USA Engineering, Edward R. Dougherty Texas A&M University, College Station, TX, USA xv xvi Contributors Peter I. Frazier School of Operations Research & Information Engineering, Cornell University, Ithaca, NY, USA John M. Gregoire Joint Center for Artificial Photosynthesis, California Institute of Technology, Pasadena, CA, USA J.E. Gubernatis Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM, USA J. Hogden CCS Division, CCS-3, Los Alamos National Laboratory, Los Alamos, USA Dandan Hu Washington University in St. Louis, St. Louis, MO, USA Chandrika Kamath Lawrence Livermore National Laboratory, Livermore, CA, USA T. Lookman Theoretical Division, Los Alamos National Laboratory, Los Alamos, USA Y.M. Marzouk Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge, MA, USA Nicholas A. Mauro North Central College, Naperville, IL, USA A.B. Mhadeshwar Center for Clean Energy and Engineering, University of Connecticut, Storrs, CT, USA; Present Address: ExxonMobil Research and Engineering, Annandale, NJ, USA Z. Nussinov Washington University in St. Louis, St. Louis, MO, USA; Department of Condensed Matter Physics, Weizmann Institute of Science, Rehovot, Israel Meyer Z. Pesenson Joint Center for Artificial Photosynthesis, California Institute of Technology, Pasadena, CA, USA G. Pilania Materials Science Division, Los Alamos National Laboratory, Los Alamos, USA Robert C. Pullar Departamento de Engenharia de Materiais e Cerâmica/CICECO —Aveiro Institute of Materials, Universidade de Aveiro, Campus Universitário de Santiago, Aveiro, Portugal Krishna Rajan Department of Materials Design and Innovation, University at Buffalo—The State University of New York, Buffalo, NY, USA R. Ramprasad Institute of Materials Science, University of Connecticut, Storrs, CT, USA; Department of Materials Science and Engineering, University of Connecticut, Storrs, CT, USA James M. Rondinelli Department of Materials Science and Engineering, Northwestern University, Evanston, USA Contributors xvii P. Ronhovde Findlay University, Findlay, OH, USA Kisor K. Sahu School of Minerals, Metallurgical and Materials Engineering, Indian Institute of Technology, Bhubaneswar, India Atsuto Seko Department of Materials Science and Engineering, Kyoto University, Kyoto, Japan T. Shearman Program in Applied Mathematics, University of Arizona, Tucson, USA S.L. Suib Department of Chemistry, University of Connecticut, Storrs, CT, USA; Institute of Materials Science, University of Connecticut, Storrs, CT, USA Bo Sun Washington University in St. Louis, St. Louis, MO, USA Santosh K. Suram Joint Center for Artificial Photosynthesis, California Institute of Technology, Pasadena, CA, USA Isao Tanaka Department of Materials Science and Engineering, Kyoto University, Kyoto, Japan J. Theiler ISR Division, Los Alamos National Laboratory, Los Alamos, USA Jialei Wang School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, USA D. Xue Theoretical Division, Los Alamos National Laboratory, Los Alamos, USA Part I Data Analytics and Optimal Learning Chapter 1 A Perspective on Materials Informatics: State-of-the-Art and Challenges T. Lookman, P.V. Balachandran, D. Xue, G. Pilania, T. Shearman, J. Theiler, J.E. Gubernatis, J. Hogden, K. Barros, E. BenNaim and F.J. Alexander Abstract We review how classification and regression methods have been used on materials problems and outline a design loop that serves as a basis for adaptively finding materials with targeted properties. T. Lookman (B) · P.V. Balachandran · D. Xue · J.E. Gubernatis · E. BenNaim Theoretical Division, T-4, Los Alamos National Laboratory, Los Alamos 87545, USA e-mail: txl@lanl.gov P.V. Balachandran e-mail: pbalachandran@lanl.gov D. Xue e-mail: xdz@lanl.gov J.E. Gubernatis e-mail: jg@lanl.gov E. BenNaim e-mail: ebn@lanl.gov G. Pilania Materials Science Division, MST-8, Los Alamos National Laboratory, Los Alamos 87545, USA e-mail: gpilania@lanl.gov T. Shearman Program in Applied Mathematics, University of Arizona, Tucson 85721, USA e-mail: toby.shearman@gmail.com J. Theiler ISR Division, Los Alamos National Laboratory, Los Alamos 87545, USA e-mail: jt@lanl.gov J. Hogden CCS Division, CCS-3, Los Alamos National Laboratory, Los Alamos 87545, USA e-mail: hogden@lanl.gov K. Barros Theoretical Division, T-1, Los Alamos National Laboratory, Los Alamos 87545, USA e-mail: kbarros@lanl.gov F.J. Alexander CCS Division, Los Alamos National Laboratory, Los Alamos 87545, USA fja@lanl.gov © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_1 3 4 T. Lookman et al. 1.1 Introduction There has been considerable interest over the last few years in accelerating the process of materials design and discovery. The Materials Genome Initiative (MGI) [1], Integrated Computational Materials Engineering (ICME) [2] and Advanced Manufacturing [3] initiatives have spurred considerable activity and brought new researchers into the nascent field of materials informatics which includes the accelerated design and discovery of new materials. The activity has also highlighted some of the open questions in this emerging area and our objective here is to provide a perspective of the field in terms of general problems and information science methods that have been used to study classes of materials, and point to some of the outstanding challenges that need to be addressed. We are guided here by our own recent work at the Los Alamos National Laboratory (LANL). One of the earliest-studied problems in modern materials informatics relates to the classification of AB solids into their stable crystal structures, based on key attributes of the chemistry and properties of the individual A and B constituents. The emphasis was on finding features that can give rise to easily visualized two-dimensional structural maps by “drawing” boundaries between classes. The problem was first studied in the 1960s [4] but Chelikowski and Phillips [5], studying the same problem in 1978, recognized the connections to information science. Realizing that energy differences between structures were rather small, they observed that “from the point of view of information theory, …the available structural data already contain a great deal of information: about 120 bits, in the case of the AB octet compounds. Thus one can reverse the problem, and attempt to extract from the available data quantitative rules for chemical bonding in solids.” They realized that suitable combinations of orbital radii of the individual A and B atoms were appropriate features for predicting the crystal structure of the AB solids. Over the last few years, this problem has been revisited with a variety of machine learning methods (decision trees,support vector machines, gradient boosting, etc.) [6–8] and there have been a number of studies that have classified different materials classes, such as perovskites [9]. Feature selection from data remains a fundamental exercise and here principal component analysis and correlation maps have been widely employed. Recently, high-throughput approaches have been utilized to form combinations of features from a given set and then certain key combinations are down-selected [6]. The problem of materials design is about predicting the composition and processing of materials with a more desired targeted property and therefore involves regression that leads to an inference model from training data. For example, for ferroelectrics one may wish to discover lead-based or lead-free piezoelectrics with a high transition temperature or high piezoelectric coefficient. For shape memory alloys, one may seek compounds with reduced dissipation or low hysteresis. Typically, such materials are usually found in an Edisonian fashion using intuition and time-consuming trial and error. In recent years, theory has become powerful enough to predict very accurately some material characteristics, for example, ab initio calculations predict elastic constants, inter-atomic distances, crystal structure, polarization, etc. However, the parameter space is just too large and there are too many 1 A Perspective on Materials Informatics: State-of-the-Art and Challenges 5 possibilities, and even if nature rules out many of the possible combinations, the numbers are still staggering. Moreover, physical and chemical constraints make the realization of many theoretically possible materials impossible. Thus, one needs to successively improve or learn from available data candidate materials for further experiments and calculations. Recently, a number of studies have utilized regression methods to predict materials with given properties. However, most research in materials design has been based on high throughput approaches using electronic structure calculations. Typically, a large database is assembled with calculated properties and this is successively screened for materials with desired properties. High-throughput experiments have also been undertaken more recently to screen for candidate materials for further experiments [10, 11]. When it comes to multicomponent alloys or solid solutions, these methods have limitations. Moreover, very few studies have combined statistical inference with the high-throughput approach. 1.2 Statistical Inference and Design: Towards Accelerated Materials Discovery Figure 1.1 illustrates our vision for the overall materials informatics/design problem. This shows a feedback loop that starts with the available assembled data (box 5), which may be obtained from multiple sources, including experiments or calculations. Materials knowledge (box 1) is then key in selecting the features and prescribing the constraints amongst them. Our aim is to train a statistical inference model that estimates the property (regression) or classification label with associated uncertainties (box 2). Classification models answer categorical questions: Is a compound stable? Fig. 1.1 Statistical Inference and design: A feedback loop to find a material with a desired targeted property. Prior or domain knowledge, including features, provide input to an inference model that predicts a label or a property with uncertainty. An experimental design or decision making module balances trade off between exploiting information or further exploring the high dimensional search space where the desired material may be found. A material is suggested for experimentation or calculation and the process repeats itself incorporating updated information 6 T. Lookman et al. Is it a piezoelectric? What is its crystal symmetry? Regression models produce numerical estimates: What is the material’s piezoelectric coefficient? What is its transition temperature? Because there usually is a limited quantity of training data, and because the space of possibilities is so high-dimensional, incorporation of domain knowledge is of potentially great value. Here explicitly Bayesian approaches, in which this knowledge is coded into prior probability distributions, and more traditional machine learning algorithms (such as support vector machines) in which case the domain knowledge could be incorporated as constraints or folded into the kernel design, become important [12]. Much existing work is essentially based on going from box 1 to box 4 in Fig. 1.1. A case in point are projects such as the Materials Project [13] and AFLOWLIB [14] focused on establishing databases using electronic structure calculations to make predictions. However, there are a few studies that use inference to make predictions. Examples include predictions of melting temperature [7, 8, 15] or piezoelectrics with high transition temperatures [16]. The search for piezoelectrics serves as a good example to contrast the two approaches. Extensive ab initio calculations were performed on a chemical space represented by 632 = 3969 possible perovskite ABO3 (up to Bi but excluding a few such as H and inert gases) end structures [17]. The number of possibilities were filtered down to 49 by discarding compounds that are nonmetallic or whose structures have small energy barriers to distortions across the morphotropic phase boundary (MPB) according to preset values. Almost no optimization or learning tools are used other than what may be involved in seeking an optimal minimum energy solution at zero temperature. All the physics is contained in this first-principles calculation, and we are not aware if any of this group’s predictions of piezoelectricity have been verified experimentally. On the other hand, the approach of Balachandran et al. [16] on the same type of problem was to focus on a given subclass of piezoelectrics (e.g. Bi based) with known crystallographic and experimental data and use off-the-shelf inference tools to obtain candidates with high transition temperatures and that were formable. The tools included principal component analysis (PCA) for dimensionality reduction, partial least squares (PLS) regression for predicting transition temperatures and recursive partitioning (or decision trees) with a metric such as Shannon entropy for classification. The training data sets for PCA or regression studies were rather small (about 20 data points, 30 features) but data sets with 350 data points were also used to identify stable/formable perovskite compounds. Two new compounds were predicted, of which one has been synthesized [18], with the predicted transition temperature differing by 30–40 %. However, a key element lacking is the issue of uncertainties in predictions. In Fig. 1.2, we demonstrate using an example, where we have used bootstrap methods (i.e. sampling with replacement) to estimate prediction uncertainties. Here, we took the same Bi-based piezoelectrics data set as that utilized in the work of Balachandran et al. [16] We generated a large number of bootstrapped samples (as opposed to using just one in the earlier work of Balachandran et al.) and utilized support vector regression (SVR) for predicting the Curie temperature (TC ). Our results with uncertainties are shown in Fig. 1.2. On average, we obtained a standard deviation of 37 ◦ C from the mean value of predicted TC . More importantly, we also 1 A Perspective on Materials Informatics: State-of-the-Art and Challenges 7 Fig. 1.2 Predictions using support vector regression (SVR) with uncertainties from bootstrap method. The piezoelectric data set of Bi-based PbTiO3 solid solutions was used for machine learning. TC (in ◦ C, y-axis) is the predicted ferroelectric Curie temperature at the morphotropic phase boundary (MPB). We use the SVR model and predicted TC for two new compounds, BiLuO3 PbTiO3 and BiTmO3 -PbTiO3 , to be 552.5 ± 79 and 564.2 ± 97 ◦ C, respectively. Experimentally, TC for BiLuO3 -PbTiO3 was measured as 565 ◦ C [18] predicted the TC for two new compounds, BiLuO3 -PbTiO3 and BiTmO3 -PbTiO3 , to be 552.5 ± 79 and 564.2 ± 97 ◦ C, respectively with 95 % confidence. Experimentally, TC for BiLuO3 -PbTiO3 was measured as 565 ◦ C [18], in close agreement with the current results from SVR. On the other hand, PLS predicted the TC for BiLuO3 PbTiO3 to be 705 ◦ C. The merit of this example is that it shows in a rather modest manner that the informatics approach, even if manual and piecemeal, is potentially capable of predicting new materials. A key aspect of our design loop is the uncertainty associated with the properties predicted from inference (box 2). These play a role in the adaptive experimental design (box 3) which suggests the next material to be chosen for further experiments or calculation (box 4) by balancing the tradeoffs between “exploration and exploitation”. That is, at any given stage a number of samples may be predicted to have given properties with uncertainties. The tradeoff is between exploiting the results by choosing to perform the next experiment on the material predicted to have the largest property or further improving the model by performing the experiment or calculation on a material where the predictions have the largest uncertainties. By choosing the latter, the uncertainty in the property is expected to (given the model, statistiscs) decrease, the model will probably improve and this will influence the results of the next iteration in the loop. While there is a considerable literature on error estimation methodologies, accurate and reliable error estimation with limited data is harder than simple prediction, and there is an even a stronger case for incorporating domain knowledge. [8, 19, 20] Extracting measures of confidence, while at the same time encoding prior knowledge, is not an easy task but recent research in cancer genomics has demonstrated that increasing confidence in classification analysis built on small databases benefits 8 T. Lookman et al. significantly from using prior knowledge [21, 22]. Prior domain knowledge constrains statistical outcomes by producing classifiers that are superior to those designed from data alone. How to use prior knowledge in classification and regression is a problem not only for materials and cancer genomics but for machine learning generally. Developing ways of constructing and using prior domain knowledge will distinguish the materials machine learning approach to classification and regression. The lesson learned from high-throughput genomics concerning classification is that, in high dimensional, small-sample settings, model-free classification is virtually impossible. The reason is that the salient property of any classifier is its error rate because the error rate quantifies its predictive capacity, which is the essential issue pertaining to scientific validity. Since the error rate must be estimated, there must be an estimation procedure and, with small samples, this procedure must be applied to the same data as that used for designing the classifier. In cancer genomics, Dalton and Dougherty [19, 20] addressed the problem by formulating error estimation as an optimization problem in a model-based framework and leads to a minimum-mean-square-error (MMSE) estimate of the classifier error. They formulate a prior probability distribution over a class of possible distributional models governing the features to be measured and the possible decisions to be made, each such model being known as a feature-label distribution. They then design a classifier from the data and an optimal MMSE error estimate is derived from the data. How well this approach will work for materials problems remains an open question. In Figs. 1.3 and 1.4 we provide more details of our loop. Figure 1.3 shows how the loop would actually work in practice, and some of the algorithms that may be Fig. 1.3 The design loop in practice showing different stages of machine learning and adaptive design strategies with an iterative feedback loop. For completeness, we have also included experiments (synthesis and characterization), which are vital for validation and feedback. KG, EGO and MOCU stand for knowledge gradient, effective global optimization and mean objective cost of uncertainty, respectively 1 A Perspective on Materials Informatics: State-of-the-Art and Challenges 9 Fig. 1.4 A sub-component of our adaptive design loop showing the synergy between statistical models (box 2), experimental design (box 3) and validation (typically via experimental synthesis or simulation as shown in box 4). Statistical models use the available data to fit a regression model (f) along with an uncertainty measure (e). The experimental design component then evaluates the tradeoff between exploitation and exploration and suggests the “best” material (yi ) for validation. Here the term “best” need not correspond to a material with the optimal response. Alternatively, it refers to the choice of a material that would reduce the overall uncertainty in our model. Different statistical learning (including Bayesian learning) and adaptive design methods are given used as part of the statistical inference and design tools, are shown in greater detail in Fig. 1.4. The green emphasize algorithms that can be utilized today and the red represent areas requiring further study and development. Design algorithms include well known exploitation-exploration strategies such as efficient global optimization (EGO) [23], and the closely related knowledge gradient(KG) [24] based on singlestep look ahead. 1.3 Progress and Concluding Remarks Our work at LANL has involved studying a number of materials problems along the lines of the approach described. These include problems involving classification learning and regression, which essentially involve an inner loop of Fig. 1.1 with boxes 2, 4 and 5. We have examined the role of features in classifying AB octet solids [8] and perovskites [9], as well as predicting new ductile RM intermetallics, where R and M are rare earth and transition metal elements, respectively [25]. These studies have suggested new features that led to better classification as well as new materials. In the case of RM intermetallics, we have shown that machine learning methods naturally uncover the functional forms that mimic most frequently used features in the literature, thereby providing a mathematical basis for feature set construction without a priori assumptions [25]. Our classification models (Fig. 1.5) that use orbital radii as features predicted that ScCo, ScIr, and YCd should be ductile, whereas each was previously proposed to be brittle. These results show it is possible to design 10 T. Lookman et al. (a) ≤ 0.311 Brittle Ductile rsM (b) > 0.311 ≤ 1.286 ≤ 1.184 ≤ 1.139 rpM rsM rsM > 1.286 Brittle > 1.184 Ductile RM-PC2 ≤ 1.3419 > 1.3419 RM-PC4 Brittle > −0.7566 ≤ −0.7566 > 1.139 Brittle Ductile Brittle Fig. 1.5 Classification learning using decision trees to predict whether a given RM intermetallic, where R and M are rare earth and transition metal elements, respectively, is brittle and ductile. a Decision tree that uses the orbital radii as features and (b) Decision tree that uses the principal components (RM-PC2 and RM-PC4) that automatically extracts features in the form of the linear M M combinations of orbital radii. For example, RM-PC2 is defined as −0.70r M p + 0.08rs − 0.71rd . M M M Features r p , rs and rd are the p-, s- and d-orbital radii of atom-M, respectively targeted mechanical properties in intermetallic compounds, which has significant implications for next-generation multi-component alloy discovery. Our on-going work on multi-objective regression includes predicting functional polymers with large band gaps, as well as large dielectric constants for energy storage applications. Similarly, we are also performing high-throughput density functional theory (DFT) calculations to generate large data sets, which are subsequently mined using machine learning methods to identify new and previously unexplored candidate water splitting compounds for catalysis. In the area of adaptive design, our focus has been on demonstrating the feedback loop of Figs. 1.1 or 1.3 with tight coupling to an “oracle”, which can be experiments (synthesis and characterization) or calculations. Specific materials studies include discovering new low thermal dissipation shape memory alloys, as well as Pb-free piezoelectric solid-solutions starting from experimental data on specific multicomponent systems. The search spaces can be well defined, for example, they can be a factor of 105 greater than the size of the training data. In addition, extensive databases from ab initio calculations become invaluable in benchmarking the various algorithms. For example, elastic moduli data for the hexagonal layered M2 AX phases consist of a library of 240 compounds. The ab initio data of the elastic constants and moduli were taken from the literature [26] with results well calibrated to experiments. In the M2 AX phases, X-atoms reside in the edge-connected M octahedral cages and the A atoms reside in slightly larger right prisms [27]. These M2 AX phases represent a unique family of materials with layered crystal structure and both metallicand ceramic-like properties. We used orbital radii of M, A, and X atoms from the Waber-Cromer scale [28] as features, which include the s-, p-, and d-orbital radii for M, while the s- and p-orbital radii were used for A and X atoms. With the M2 AX data, we benchmarked our adaptive design strategy, i.e. explored different training set sizes, regressors, regressor/optimization combinations, etc., and uncovered invaluable guidelines that were eventually useful for real materials design problems. 1 A Perspective on Materials Informatics: State-of-the-Art and Challenges 11 Implementing the loop using simulation codes allows us to optimize the use of these codes in seeking a well defined set of parameters or constraints for given targeted outcomes. For example, an industry standard code for simulating semiconducting materials is APSYS (Advanced Physical Models of Semiconductor Devices). It is based on 2D/3D finite element analysis of electrical, optical and thermal properties of compound semiconductor devices, with silicon as a special case with an emphasis on band structure engineering and quantum mechanical effects. Inclusion of various optical modules allows one to configure applications involving photosensitive or light emitting diodes (LEDs). We have been recently using APSYS to investigate how to optimize the LED structure (number of quantum wells, indium concentration) of GaAs based systems for highest internal quantum efficiencies at high currents. In summary, the use of classification and regression methods, in combination with optimization strategies, has the potential to impact discovery and design in materials science. What is needed is to establish how these tools perform on an array of materials classes with differing physics in order to distill some guiding principles for use by the materials community at large. Acknowledgments We acknowledge funding support from a Laboratory Directed Research and Development (LDRD) DR (#20140013DR) at the Los Alamos National Laboratory (LANL). References 1. Materials Genome Initiative for Global Competitiveness (2011) 2. S.R. Kalidindi, M. De Graef, Materials data science: current status and future outlook. Ann. Rev. Mater. Res. 45(1), 171–193 (2015) 3. T.D. Wall, J.M. Corbett, C.W. Clegg, P.R. Jackson, R. Martin, Advanced manufacturing technology and work design: towards a theoretical framework. J. Organ. Behav. 11(3), 201–219 (1990) 4. E. Mooser, W.B. Pearson, On the crystal chemistry of normal valence compounds. Acta Crystallogr. 12, 1015–1022 (1959) 5. J.R. Chelikowsky, J.C. Phillips, Quantum-defect theory of heats of formation and structural transition energies of liquid and solid simple metal alloys and compounds. Phys. Rev. B 17, 2453–2477 (1978) 6. L.M. Ghiringhelli, J. Vybiral, S.V. Levchenko, C. Draxl, M. Scheffler, Big data of materials science: critical role of the descriptor. Phys. Rev. Lett. 114, 105503 (2015) 7. Y. Saad, D. Gao, T. Ngo, S. Bobbitt, J.R. Chelikowsky, W. Andreoni, Data mining for materials: computational experiments with AB compounds. Phys. Rev. B 85, 104104 (2012) 8. G. Pilania, J.E. Gubernatis, T. Lookman, Structure classification and melting temperature prediction of octet AB solids via machine learning. Phys. Rev. B 91, 124301 (2015) 9. G. Pilania, P.V. Balachandran, J.E. Gubernatis, T. Lookman, Predicting the formability of ABO3 perovskite solids: a machine learning study. Acta Crystallogr. B 71, 507–513 (2015) 10. S.M. Senkan, High-throughput screening of solid-state catalyst libraries. Nature 394 (6691), 350–353, 07 (1998) 11. H. Koinuma, I. Takeuchi, Combinatorial solid-state chemistry of inorganic materials. Nat. Mater. 3, 429–438 (2004) 12. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer, New York, 2008) 12 T. Lookman et al. 13. A. Jain, S.P. Ong, G. Hautier, W. Chen, W.D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, K.A. Persson, Commentary: the materials project: a materials genome approach to accelerating materials innovation. APL Mater. 1(1) (2013) 14. S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R.H. Taylor, L.J. Nelson, G.L. Hart, S. Sanvito, M. Buongiorno-Nardelli, N. Mingo, O. Levy, AFLOWLIB.ORG: a distributed materials property repository from high-throughput ab initio calculations. Comput. Mater. Sci. 58, 227–235 (2012) 15. A. Seko, T. Maekawa, K. Tsuda, I. Tanaka, Machine learning with systematic density-functional theory calculations: application to melting temperatures of single-and binary-component solids. Phys. Rev. B 89, 054303 (2014) 16. P.V. Balachandran, S.R. Broderick, K. Rajan, Identifying the inorganic gene for hightemperature piezoelectric perovskites through statistical learning. Proc. R. Soc. A: Math. Phys. Eng. Sci. 467(2132), 2271–2290 (2011) 17. R. Armiento, B. Kozinsky, M. Fornari, G. Ceder, Screening for high-performance piezoelectrics using high-throughput density functional theory. Phys. Rev. B 84, 014103 (2011) 18. W. Hu, Experimental search for high Curie temperature piezoelectric ceramics with combinatorial approaches. Ph.D. dissertation, Iowa State University (2011) 19. L.A. Dalton, E.R. Dougherty, Optimal classifiers with minimum expected error within a Bayesian framework–Part I: discrete and Gaussian models. Pattern Recognit. 46(5), 1301– 1314 (2013) 20. L.A. Dalton, E.R. Dougherty, Optimal classifiers with minimum expected error within a Bayesian framework—Part II: properties and performance analysis. Pattern Recognit. 46(5), 1288–1300 (2013) 21. K.E. Lee, N. Sha, E.R. Dougherty, M. Vannucci, B.K. Mallick, Gene selection: a Bayesian variable selection approach. Bioinformatics 19(1), 90–97 (2003) 22. E.R. Dougherty, A. Zollanvari, U.M. Braga-Neto, The illusion of distribution-free small-sample classification in genomics. Curr genomics 12(5), 333–341 (2011) 23. D.R. Jones, M. Schonlau, W.J. Welch, Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998) 24. W. Powell, I. Ryzhov, Optimal Learning, Wiley Series in Probability and Statistics (Wiley, Hoboken, 2013) 25. P.V. Balachandran, J. Theiler, J. M. Rondinelli, T. Lookman, Materials Prediction via Classification Learning Sci. Rep. 5, 13285 (2015) 26. M.F. Cover, O. Warschkow, M.M.M. Bilek, D.R. McKenzie, A comprehensive survey of M2 AX phase elastic properties. J. Phys.: Condens. Matter 21(30), 305403 (2009) 27. M.W. Barsoum, M. Radovic, Elastic and mechanical properties of the MAX phases. Ann. Rev. Mater. Res. 41, 195–227 (2011) 28. J.T. Waber, D.T. Cromer, Orbital radii of atoms and ions. J. Chem. Phys. 42(12), 4116–4123 (1965) Chapter 2 Information-Driven Experimental Design in Materials Science R. Aggarwal, M.J. Demkowicz and Y.M. Marzouk Abstract Optimal experimental design (OED) aims to maximize the value of experiments and the data they produce. OED ensures efficient allocation of limited resources, especially when numerous repeated experiments cannot be performed. This chapter presents a fully Bayesian and decision theoretic approach to OED—accounting for uncertainties in models, model parameters, and experimental outcomes, and allowing optimality to be defined according to a range of possible experimental goals. We demonstrate this approach on two illustrative problems in materials research. The first example is a parameter inference problem. Its goal is to determine a substrate property from the behavior of a film deposited thereon. We design experiments to yield maximal information about the substrate property using only two measurements. The second example is a model selection problem. We design an experiment that optimally distinguishes between two models for helium trapping at interfaces. In both instances, we provide model-based justifications for why the selected experiments are optimal. Moreover, both examples illustrate the utility of reduced-order or surrogate models in optimal experimental design. 2.1 Introduction Experiments are essential prerequisites of all scientific research. They are the basis for developing and refining mathematical models of physical reality. Experimental data are used to infer model parameters, to improve the accuracy of model-based predictions, to discriminate among competing models, to assess model validity, and R. Aggarwal · M.J. Demkowicz (B) Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA e-mail: demkowicz@mit.edu Y.M. Marzouk Department of Aeronautics and Astronautics, Room 37-451, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA e-mail: ymarz@mit.edu © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_2 13 14 R. Aggarwal et al. to improve design and decision-making under uncertainty. Yet experimental observations can be difficult, time-consuming, and expensive to acquire. Maximizing the value of experimental observations—i.e., designing experiments to be optimal by some appropriate measure—is therefore a critical task. Experimental design encompasses questions of where and when to measure, which variables to interrogate, and what experimental conditions to employ. Conventional experimental design methods, such as factorial and composite designs, are largely used as heuristics for exploring the relationship between input factors and response variables. By contrast, optimal experimental design uses a concrete hypothesis—expressed as a quantitative model—to guide the choice of experiments for a particular purpose, such as parameter inference, prediction, or model discrimination. Optimal design has seen extensive development for linear models (where the measured quantities depend linearly on the model parameters) endowed with Gaussian distributions [5]. Extensions to nonlinear models are often based on linearization and Gaussian approximations [15, 21, 36], as analytical results are otherwise impractical or impossible to obtain. With advances in computational power, however, optimal experimental design for nonlinear systems can now be tackled directly using numerical simulation [48, 49, 64, 65, 84, 89, 93, 96]. This chapter will present an overview of model-based optimal experimental design, connecting this approach to illustrative applications in materials science—a field replete with potential applications for optimal experimentation. We will take a fully Bayesian and decision-theoretic approach. In this formulation, one first defines the utility of an experiment and then, taking into account uncertainties in both the parameter values and the observations, chooses experiments by maximizing an expected utility. We will define these utilities according to information theoretic considerations, reflecting the particular experimental goals at hand. The evaluation and optimization of information theoretic design criteria, in particular those that invoke complex physics-based models, requires the synthesis of several computational tools. These include: (1) statistical estimators of expected information gain; (2) efficient optimization methods for stochastic or noisy objectives (since expected utilities are typically evaluated with Monte Carlo methods); and (3) reduced-order or surrogate models that can accelerate the estimation of information gain. For a simple film-substrate system, we will present an example of such a reduced-order model, derived from physical scaling principles and an “offline” set of detailed/full model simulations. This is but one example; reduced-order models constructed through a variety of techniques have practical use in a wide range of optimal experimental design applications [48]. The rest of this chapter is organized as follows. Section 2.2 will present the foundational tools of optimal experimental design, beginning with Bayesian inference and proceeding to discuss several information theoretic design criteria. It will also discuss the computational challenges presented by this formulation. Section 2.3 will illustrate the information-driven approach with two examples: optimal design for parameter inference, in the context of a film-substrate system; and optimal design for model selection, in the context of heterophase interfaces in layered metal composites. Section 2.4 will discuss open questions and topics of ongoing research. 2 Information-Driven Experimental Design in Materials Science 15 2.2 The Tools of Optimal Experimental Design We will formulate our experimental design criteria in a Bayesian setting. Bayesian statistics offers a foundation for inference from noisy, indirect, and incomplete data; a mechanism for incorporating multiple heterogeneous sources of information; and a complete assessment of uncertainty in parameters, models, and predictions. The Bayesian approach also provides natural links to decision theory, which we will exploit below. 2.2.1 Bayesian Inference The essence of the Bayesian paradigm is to describe uncertainty or lack of knowledge probabilistically. This idea applies to model parameters, to observations, and even to competing models. For simplicity, we first describe the case of parameter inference. Let θ ∈  ⊆ Rn represent the parameters of a given model. We describe our state of knowledge about these parameters with a prior probability density p(θ). (For the remainder of this article, we assume that all parameter and data probability distributions have densities with respect to Lebesgue measure.) We would like to update our knowledge about θ by performing an experiment at conditions η ∈ H ⊆ Rd . η is therefore our vector of experimental design parameters. This experiment will yield observations y ∈ Y ⊆ Rm . The relationship between the model parameters, experimental conditions, and observations is captured by the likelihood function p(y|θ, η), i.e., the probability density of the observations given a particular choice of θ, η. The likelihood naturally incorporates a physical model of the experiment. For instance, one often has a computational model G(θ, η) that predicts the quantity being measured by a proposed experiment. This prediction may be imperfect, and is almost always corrupted by some observational errors. A simple likelihood then results from the additive model y = G(θ, η) + , where  is a random variable representing measurement and model errors. If  is Gaussian with mean zero of θ and η, then we have the Gaussian likelihood and variance σ 2, and independent  p(y|θ, η) ∼ N G(θ, η), σ 2 . More complex likelihoods describe signal-dependent noise, or include more sophisticated representations of model error (e.g., the discrepancy models of [53]). Putting these ingredients together via Bayes’ rule, we obtain the posterior probability density p(θ|y, η) of the parameters: p(θ|y, η) = p(y|θ, η) p(θ) , p(y|η) (2.1) where we have assumed (quite reasonably) that the prior knowledge on the parameters is independent of the experimental design. The posterior density describes the state of knowledge about the parameters θ after conditioning on the result of the experiment. 16 R. Aggarwal et al. The design criteria described below will formalize the intuitive idea of choosing values of η to make the posterior distribution of θ as “informed” as possible. Many problems, whether in materials science or other domains, do not have parameter inference as an end goal. Rather than learning about parameters that appear in a single fixed model of interest, one may wish to collect data that help discriminate among competing models. For instance, different hypothesized physical mechanisms may lead to different models of a phenomenon. In this context, the Bayesian approach involves characterizing a posterior probability distribution over models. Let the model space M consist of an enumerable number of competing models Mi , i ∈ {1, 2, . . .}. Let each model Mi be endowed with parameters θi ∈ i ⊆ Rni . Then Bayes’ rule writes the posterior probability of a model Mi as: P(Mi |y, η) = p(y|Mi , η)P(Mi ) , p(y|η) (2.2) where the marginal likelihood of each model (i.e., p(y|Mi , η) for the ith model) is obtained by averaging the likelihood over the prior distribution on the model’s parameters:  p(y|Mi , η) = i p(y|θi , η, Mi ) p(θi |Mi )dθi . (2.3) Each model has its own parameters θi and its own prior p(θi |Mi ). The marginal likelihood incorporates an automatic Occam’s razor that penalizes unnecessary model complexity [8, 66]. The effective use of the posterior distribution over models P(Mi |y, η) can then depend on the goals at hand. For instance, one may wish to know which model is best supported by the data; in this case, one simply selects the model with the highest posterior probability, thus performing Bayesian model selection. Alternatively, if the end goal is to make a prediction that accounts for model uncertainty, one can perform Bayesian model averaging [46] by taking a linear combination of predictions from each model, weighed according to the posterior model probabilities. 2.2.2 Information Theoretic Objectives Following a decision theoretic approach, Lindley [63] suggests that an objective for experimental design should have the following general form:   U (η) = Y  u(η, y, θ) p(θ, y|η) dθ dy, (2.4) where u(η, y, θ) is a utility function and U (η) is the expected utility. The utility function u should be chosen to reflect the usefulness of an experiment at conditions η, given a particular value of the parameters θ and a particular outcome y. Since 2 Information-Driven Experimental Design in Materials Science 17 we do not know the precise value of θ and we cannot know the outcome of the experiment before it is performed, we obtain U by taking the expectation of u over the joint distribution of θ and y; hence the name ‘expected’ utility. The choice of utility function u reflects the purpose of the experiment. To accommodate nonlinear models and avoid restrictive distributional assumptions on the parameters or model predictions, we advocate the use of utility functions that reflect the gain in Shannon information in quantities of interest [42]. For instance, if the object of the experiment is parameter inference, then a useful utility function is the relative entropy or Kullback-Leibler (KL) divergence from the posterior to the prior: u(η, y, θ) = u(η, y) = DKL ( p(θ|y, η)  p(θ))  p(θ|y, η) = dθ. p(θ|y, η) log p(θ) θ (2.5) Taking the expectation of this quantity over the prior predictive of the data, as in (2.4), yields a U equal to the expected information gain in θ. This quantity is equivalent to the mutual information [26] between the data and the parameters, I (y; θ). Inferring parameters may not be the true object of an experiment, however. For many experiments, the goal is to improve predictions of some quantity Q. This quantity may depend strongly on some model parameters and weakly on others. Moreover, some model parameters might simply be “knobs” without a strict physical interpretation or meaning. In this setting, we can put u(η, y, θ) = u(η, y) equal to the Kullback-Leibler divergence evaluated from the posterior predictive  distribution, p(Q|y, η) = p(Q|θ) p(θ|y, η)dθ, to the prior predictive distribution,  p(Q) = p(Q|θ) p(θ)dθ. Taking the expectation of this utility function over the data yields U (η) = I (Q; y|η), that is, the conditional mutual information between data and predictions. This quantity implicitly incorporates an information theoretic “forward” sensitivity analysis, as the experiments that are most informative about Q will automatically constrain the directions in the parameter space that strongly influence Q. As mentioned above, another common experimental goal is model discrimination. From the Bayesian perspective, we wish to maximize the relative entropy between the posterior and prior distributions over models: u(η, y) =  i P(Mi |y, η) log P(Mi |y, η) . P(Mi ) (2.6) Moving from this utility to an expected utility requires integrating over the prior predictive distribution of the data, as specified in (2.4). Since the utility function u here  does not depend on the parameters θ, we simply have U (η) = Y u(η, y) p(y|η)dy. Because we are now considering multiple competing models, however, the prior predictive distribution is itself a mixture of the prior predictive distribution of each model: 18 R. Aggarwal et al. p(y|η) =  P(Mi ) p(y|Mi , η) = i   P(Mi ) i i p(y|θi , η, Mi ) p(θi |Mi )dθi . (2.7) The resulting expected information gain in model space favors designs that are expected to focus the posterior distribution onto fewer models [75]. In more intuitive terms, we will be driven to test where we know the least and where we also expect to learn the most. 2.2.3 Computational Considerations Evaluating expected information gain. Except in special cases (e.g., linearGaussian models), the expected utilities described above cannot be evaluated in closed form. Instead, the integrals in these expressions must be approximated numerically. Note that, even in the simplest case of parameter inference—with utility given by (2.5)—evaluating the posterior density of the parameters requires calculating the posterior normalizing constant, which (like the posterior distribution itself) is a function of the data y and the design parameters η. In this situation, it is convenient to rewrite the expected information gain in the parameters θ as follows:   p(θ|y, η) dθ p(y|η) dy p(θ|y, η) log p(θ) Y      p(y|θ, η) = p(y|θ, η) p(θ) dθ dy log p(y|η) Y    {log p(y|θ, η) − log p(y|η)} p(y|θ, η) p(θ) dθ dy, =   U (η) = Y (2.8)  where the second equality is due to the application of Bayes’ rule to the quantities both inside and outside the logarithm. Introducing Monte Carlo approximations of the evidence p(y|η) and the outer integrals, we obtain the nested Monte Carlo estimator proposed by Ryan [84]: U (η) ≈ Û N ,M (η) ⎧ ⎛ ⎞⎫ N M ⎬   1 1 ⎨  := p(y (i) |θ̃(i, j) , η)⎠ . log p(y (i) |θ(i) , η) − log ⎝ ⎭ ⎩ N M i=1 (2.9) j=1 Here {θ(i) } and {θ̃(i, j) }, i = 1 . . . N , j = 1 . . . M, are independent samples from the prior p(θ), and each y (i) is an independent sample from the likelihood p(y|θ(i) , η), for i = 1 . . . N . The variance of this estimator is approximately A(η)/N + B(η)/(N M), and its bias is (to leading order) C(η)/M [84], where A, B, and C are terms that depend only on the distributions at hand. The estimator Û N ,M is thus biased for finite M, but asymptotically unbiased. 2 Information-Driven Experimental Design in Materials Science 19 Analogous, though more complex, Monte Carlo estimators can be derived for the expected information gain in some predictions Q, or for the expected information gain in the model indicator Mi . Optimization approaches. Regardless of the particular utility function u used to define U , selecting an optimal experimental design requires solving an optimization problem of the form: max U (η). (2.10) η∈H Using the Monte Carlo approaches described above, only noisy estimates (e.g., Û N ,M ) of the objective function U are available. Hence, the optimal design problem becomes a stochastic optimization problem, typically over a continuous design space H. Many algorithms have been devised to solve continuous optimization problems with stochastic objectives. While some do not require the direct evaluation of gradients (e.g., Nelder-Mead [76], Kiefer-Wolfowitz [54], and simultaneous perturbation stochastic approximation [90]), other algorithms can use gradient evaluations to great advantage. Broadly, these algorithms involve either stochastic approximation (SA) [56] or sample average approximation (SAA) [87], where the latter approach must also invoke a gradient-based deterministic optimization algorithm. SA requires an unbiased estimator of the gradient of the objective, computed anew at each optimization iteration. SAA approaches, on the other hand, “freeze” the randomness in the objective and solve the resulting deterministic optimization problem, the solution of which yields an estimate of the solution of (2.10) [6]. Hybrids of the two approaches are possible as well. [49] presents a systematic comparison of SA and SAA approaches in the context of optimal experimental design, where SAA is coupled with a BFGS quasi-Newton method for deterministic optimization. An alternative approach to the optimization problem (2.10) involves constructing and optimizing Gaussian process models of U (η), again from noisy evaluations. As presented in [96], this approach generalizes the EGO (efficient global optimization) algorithm of [51] by choosing successive evaluation points η according to an expected quantile improvement criterion [80]. Surrogate models. An efficient optimization approach is only one part of the computational toolbox for optimal experimental design. Evaluating estimators such as Û N ,M (η) (2.9) for even a single value of η can be computationally taxing when the likelihood p(y|θ, η) contains a computationally intensive model G(θ, η)—a situation that occurs very often in physical systems, including in materials science. As a result, considerable effort has gone into the development of reduced-order or “surrogate” models, designed to serve as computationally inexpensive replacements for G. Useful surrogate models can take many different forms. [34] categorizes surrogates into three different classes: data-fit models, reduced-order models, and hierarchical models. Data-fit models are typically generated using interpolation or regression of the input-output relationship induced by the high-fidelity model G(θ, η), based on evaluations of G at selected input values (θ(i) , η (i) ). This class includes 20 R. Aggarwal et al. polynomial chaos expansions that are constructed non-intrusively [41, 57, 100] and, more broadly, interpolation or pseudospectral approximation with standard basis functions on (adaptive) sparse grids [24, 40, 101]. Gaussian process emulators [53, 99], widely used in the statistics community, fall into this category as well. Indeed, the systematic and efficient construction of data-fit surrogates, particularly for highdimensional input spaces, has been the focus of a vast body of work in computational mathematics and statistics over the past decade. While many of these methods are used in forward uncertainty propagation (e.g., the solution of PDEs with random input data), recent work [48] has employed sparse grid polynomial surrogates specifically for the case of optimal Bayesian experimental design. Reduced-order models are commonly derived using a projection framework; that is, the governing equations of the forward model are projected onto a subspace of reduced dimension. This reduced subspace is defined via a set of basis vectors, which, for general nonlinear problems, can be calculated via the proper orthogonal decomposition (POD) [47, 81, 88] or with reduced basis methods [43, 77]. For both approaches, the empirical basis is pre-constructed using full forward problem simulations or “snapshots.” Systematic projection-based model reduction schemes for parameter-dependent models have also seen extensive development in recent years [17, 22]. To our knowledge, such reduction schemes have not yet been used for optimal experimental design, but in principle they are directly applicable. Hierarchical surrogate models span a range of physics-based models of lower accuracy and reduced computational cost. Hierarchical surrogates are derived from higher-fidelity models using approaches such as simplifying physics assumptions, coarser grids, alternative basis expansions, and looser residual tolerances. These approaches may not be particularly systematic, in that their success and applicability are strongly problem-dependent, but they can be quite powerful in certain cases. One of the examples in the next section will use a reduced order model derived from a combination of simplifying physics assumptions and fits to simulation data from a higher-fidelity model. 2.3 Examples of Optimal Experimental Design In this section, we present two examples of Bayesian experimental design in materials-related applications. The first illustrates experimental design for parameter estimation in a simple substrate-film model. This example also demonstrates the usefulness of reduced-order models in accelerating the design process. The second example is concerned with experimental design for model selection. It will illustrate this process using competing models of impurity precipitation at heterophase interfaces. 2 Information-Driven Experimental Design in Materials Science 21 2.3.1 Film-Substrate Systems: Design for Parameter Inference A classical application of Bayesian methods to physical modeling involves inferring the properties of the interior of an object from observations of its surface, e.g., of the mantle or core of the Earth from observations at the Earth’s crust [16, 44]. In the context of materials science, similar problems arise when observing the surface of a material and trying to infer the subsurface properties. One example of such a problem involves observing a thin film deposited on a heterogeneous substrate. The heterogeneity of the substrate—e.g., in temperature [58], local chemistry [3], or topography [14]—induces some corresponding heterogeneity in the film—e.g., melting [58], condensation [3], or buckling [14]. The goal is to deduce information about the substrate from the behavior of the film. We have recently developed a convenient model for studying the inference of substrate properties from film behavior [2]. Figure 2.1 shows a film deposited on a substrate. Though the substrate is not directly observable, we would like to infer its properties from the behavior of the film deposited above. In the present example, we will use this simple model to demonstrate aspects of Bayesian experimental design. Our objective will be to choose experiments that provide maximal information about a parameter of interest for a fixed number of allowed experiments. 2.3.1.1 Physical Background In our model problem, the substrate is described by a non-uniform scalar field T (x, y) on a two-dimensional spatial domain, (x, y) ∈  := [0, L D ] × [0, L D ]. In other words, T (x, y) describes the variation of the substrate property T over a square domain. Realizations of the substrate are random, and hence we model T (x, y) as a Fig. 2.1 A film deposited on top of a substrate. The substrate is not directly observable, but some of its properties may be inferred from the behavior of the film 22 R. Aggarwal et al. zero-mean Gaussian random field with a squared exponential covariance kernel [82]. One of the key parameters of this covariance kernel is the characteristic length scale s , which describes the scale over which spatial variations in the random field occur. When s is large, realizations of the substrate field have a relatively coarse structure, while smaller values of s produce realizations with more fine-scale variation. The film deposited on the substrate is a two-component mixture represented by an order parameter field c(x, y, t). The order parameter takes values in the range [−1, 1], where c = −1 and c = 1 represent single-component phases and c = 0 represents a uniformly mixed phase. The behavior of the film is modeled by the Cahn-Hilliard equation [18]:   ∂g ∂c = − 2 c , ∂t ∂c where g (c, T (x, y)) = c2 c4 + T (x, y) 4 2 (2.11) (2.12) is a substrate-dependent energy potential function. The two components of the film separate in regions of the substrate where T (x, y) < 0 and mix in regions where T (x, y) > 0. Hence, the substrate field can be thought of as a difference from some critical temperature, where temperatures above the critical value promote phase mixing while those below the critical value promote phase separation. The parameter  in (2.11) governs the thickness of the interface between separated phases. Films with larger values of  have thicker interfaces between their phase-separated regions than films with lower values of . We model the time evolution of an initially uniform film c(x, y, t = 0) = 0 deposited on a substrate by solving the Cahn-Hilliard equation using Eyre’s method for time discretization [35]. We find that the order parameter field c converges to a static configuration in the long time limit for any combination of s and . A detailed description of the model implementation and analysis of the time-dependence of c is given in [2]. For the purpose of the example presented here, it suffices to know that the converged order parameter field has a characteristic length scale of its own, which we call ∞ . Figure 2.2 illustrates converged order parameter fields of films with two different values of  ( = 0.02 and  = 0.04) deposited on substrates with two different values of s (s = 0.77 and  = 0.13). For both substrates, we observe that increasing the value of  increases the value of ∞ . Yet the behavior of the film on the two substrates is qualitatively different. For the substrate with s = 0.77, the thickness of interfaces between phase-separated parts of the film is sufficiently small for fluctuations in c to be correlated with fluctuations in T . By contrast, no direct correlation of this sort exists for the substrate with s = 0.13, because its characteristic length is smaller than the thickness of interfaces between phase-separated parts of the films in Fig. 2.2e, f. Instead, the fluctuations in c for these films reflects a local spatial average of T over a length scale that depends on . 2 Information-Driven Experimental Design in Materials Science 23 Fig. 2.2 Substrate fields with (a) s = 0.77 and (d) s = 0.13. Plots b and c show converged order parameter distributions for films deposited on the substrate in a with  = 0.02 and  = 0.04 respectively. Similarly, plots e and f show converged order parameter distributions for films deposited on the substrate in d with  = 0.02 and  = 0.04 respectively. The converged length scale ∞ is indicated for each film The value of  determines how ∞ changes with s . For example, in films with  = 0.02, reducing the value of s from 0.77 to 0.13 reduces ∞ from 1.05 to 0.83. However, the opposite effect is observed for  = 0.04, where reducing s from 0.77 to 0.13 increases ∞ from 1.35 to 2.93. These observations show that ∞ , s , and  are related, albeit in a non-trivial way. Our goal is to infer the substrate length scale s from the value of ∞ of a film of known , deposited on the substrate. In this context, ∞ is the data obtained from an experiment, s is the value to be inferred, and  is a parameter of the experiment that we control (e.g., by manipulating the chemical composition of the film). In previous work, we showed how to perform this inference and how to improve it by performing multiple measurements of ∞ using films with different  values [2]. In the experimental design problem described here, we would like to choose optimal values of  that lead to the most efficient inference of s . For any given s and , ∞ may be obtained by solving the Cahn-Hilliard equation for the time evolution of the film on the substrate. This calculation does not call for extraordinary computational resources; indeed, it can be performed in roughly 100 s on a modern workstation. In Bayesian experimental design, however, this calculation would have to be carried out many millions of times. The potential computational 24 R. Aggarwal et al. effort of this approach is compounded by the stochasticity of T (x, y); to evaluate the likelihood function for any given value of s , we must account for many possible substrate field realizations. Therefore, to make optimal experimental design tractable, we construct a “reduced order model” (ROM) relating ∞ , s , and . We use a relation of the form reduced order model    (2.13) ∞ = f (, s ) + γ(, s ) .       deterministic term random term The deterministic term captures the average response of the film/substrate system, and the random term captures the inherent stochasticity of the film/substrate system and any systematic error in the deterministic term. The stochasticity of the film/substrate system is due to the random nature of the substrate field and the initial condition of the Cahn-Hilliard equation, among other factors [2]. The proposed ROM can be simplified using the Buckingham Pi theorem [102]. Since , s , and ∞ all have dimensions of length, we can form two Pi groups: (∞ /) and (s /). The ROM may then be simplified to ∞ =F   s    + s   . (2.14) To obtain the form of F(s /) and (s /), we carried out multiple runs of the CahnHilliard model, with values of s sampled over [0.1, 1] and values of  sampled over [0.01, 0.1]. Figure 2.3a plots ∞ / as a function of s /, confirming that these quantities lie on a single curve, on average. However, there is a spread about this curve as well. This is caused by the stochastic nature of the relation between ∞ / and s /, and justifies the random term in the ROM. The exact forms of F(s /) and (s /) are then:    s b s = a+ F   (s / − 1)c      s 2 s ∼ N 0, σ     (2.15) (2.16) with parameters of the mean term F obtained by least squares fitting: a = 1.05 b = 79.51 c = 1.54. The dependence of σ 2 on (s /) is captured nonparametrically using Gaussian process regression [82], as shown in Fig. 2.3b. Details of the derivation of the ROM can be found in [2]. To perform inference, we use the Cahn-Hilliard model as a proxy for a physical experiment. We generate multiple realizations of substrates with the same value of s . Then, using each substrate as an input, we run the Cahn-Hilliard model, which also requires  as a parameter. Given one or more choices for  and the values of ∞ 2 Information-Driven Experimental Design in Materials Science (a) 25 (b) Fig. 2.3 a A plot of ∞ / against s /. b A plot of the non-stationary variance of the random term (s /) thus obtained, we infer the value of s using the ROM. Inference may be conducted using one or multiple (∞ , ) pairs. To infer s in a Bayesian setting, we need to calculate the likelihood p(∞ , ). This can be done using the ROM as follows: p (∞ |s , ) = √   (∞ /s − F(s /))2 . exp − 2σ 2 (s /) 2πσ(s /) 1 (2.17) Since runs of the Cahn-Hilliard equation are conditionally independent given s and , the likelihood for multiple (∞ , ) pairs can be found using the product rule      p ∞,i |s , i . p ∞,1:n |s , 1:n = (2.18) i Finally, the posterior density is calculated using Bayes’ rule p(s |∞,1:n , 1:n ) =  p(∞,1:n |s , 1:n ) p(s ) . p(∞,1:n |s , 1:n ) p(s )ds (2.19) We use a truncated Jeffreys prior [50] for s p(s ) ∝ ln(1/s ), s ∈ [0.1, 1]. (2.20) The prior density is set to zero outside the range [0.1, 1]. This restriction is imposed for reasons of computational convenience and may easily be relaxed. 26 R. Aggarwal et al. (a) (b) Fig. 2.4 a Posterior probability densities for different numbers of (∞ , ) pairs. With the inclusion of ever more data, uncertainty in the posterior on s decreases steadily. b Posterior variance and error in posterior mean for different numbers of (∞ , ) pairs. Both error and variance decrease with increasing numbers of data points The results of an iterative inference process that incorporates successive (∞ , ) pairs are shown in Fig. 2.4a. Here, the true value of the substrate length scale (i.e., the value used to generate the data) is s = 0.4. Values of  are selected by sampling uniformly in log-space over the interval [0.01, 0.1]. The probability density marked ‘0’ (i.e., with zero data points) is the prior. The posterior probability density with one data point (marked ‘1’) is bimodal, but the bimodality of the posterior vanishes with two or more data points. As additional (∞ , ) pairs are introduced, the peak in the posterior moves towards the true value of s = 0.4. Any number of point estimates of s may be calculated from the posterior, such as the mean, median, or mode, but the posterior probability density itself gives a full characterization of the uncertainty in s . As an example, we have plotted in Fig. 2.4b both the posterior variance (a measure of uncertainty) and the absolute difference between the posterior mean and the true value of s (a measure of error) for different numbers of data points. As more data are used in the inference problem, both the posterior variance and the error in the posterior mean decrease. Note that the ultimate convergence of the posterior mean towards the true value of s , as the number of data points approaches infinity, is a more subtle issue; it is related to the frequentist properties of this Bayesian estimator, here in the presence of model error. For a fuller discussion of this topic, see [2]. 2.3.1.2 Bayesian Experimental Design Thus far, we have described a model problem wherein the characteristic length scale s of a substrate is inferred from the behavior of films with known values of , deposited on the substrate. In the preceding calculations, we chose  randomly from 2 Information-Driven Experimental Design in Materials Science 27 0.09 3 0.08 0.07 2.5 0.06 2 0.05 0.04 1.5 0.03 1 0.02 0.01 0.01 0.5 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Fig. 2.5 Map of expected information gain U (1 , 2 ) in the substrate length scale parameter s , as a function of experimental design parameters 1 and 2 . The three experiments discussed in the text are marked with red squares a distribution. Since  is in fact an experimental parameter that we can control, this choice is equivalent to performing experiments at random. Now we would like to consider a more focused experimental campaign, choosing values of  to maximize the information gained with each experiment. In the language of Sect. 2.2, we will take our utility function u to be the Kullback-Leibler (KL) divergence from the posterior to the prior (2.5). The expected utility (2.4) will represent expected information gain in the parameters s . To connect the present problem to the general formulation of Sect. 2.2, note that ∞ is the experimental data y, s is the parameter θ to be inferred, and  is the experimental parameter η over which we will optimize the expected utility. The expected KL divergence from posterior to prior is estimated via the Monte Carlo estimator in (2.9). To perform the calculation, we need to be able to sample (i, j) (i) from the likelihood p(∞ |s(i) , ). The s(i) and ˜s from the prior p(s ), and ∞ length scales s can be sampled from the truncated Jeffreys prior using a standard inverse CDF transformation [83]. The observation ∞ is Gaussian given  and s , and can be sampled by evaluating (2.14) with distributional information given in (2.15)–(2.16). 28 R. Aggarwal et al. We will use this formulation to design an optimal experiment consisting of two measurements. In other words, two films with independently controlled values of  will be deposited on substrates with the same value of s , and the two values of ∞ generated will be used for inference. The values of  will be restricted to the design range [0.01, 0.095]. As before, this restriction is not essential and is easily relaxed. Figure 2.10 shows the resulting Monte Carlo estimates of expected information gain U (1 , 2 ). Because the ordering of the experiments is immaterial, the map of the expected information gain is symmetric about the 1 = 2 line, aside from Monte Carlo estimation error. We draw attention to three points marked by squares in Fig. 2.10. The first is at (1 , 2 ) = (0.025, 0.025), where U (1 , 2 ) = 0.49; it is near the minimum of the expected utility function. This point corresponds to the least useful pair of experiments. The second is at (1 , 2 ) = (0.01, 0.095), with U (1 , 2 ) = 2.9; it is the maximum of the expected utility map and is expected to yield the most informative experiments. The point (1 , 2 ) = (0.08, 0.08), where U (1 , 2 ) = 2.0, lies midway between these extremes: it is expected to be more informative than the first design but less informative than the second. To illustrate how the three (1 , 2 ) pairs highlighted above yield different expected utilities, we carry out the corresponding inferences of s following the procedure described in Sect. 2.3.1.1. To simulate each experiment, we fix s and the desired value of , then generate a converged order parameter length scale ∞ by generating a realization of the substrate and simulating the Cahn-Hilliard equation. Given the data ∞,1 and ∞,2 corresponding to (1 , 2 ), we evaluate the corresponding posterior density and calculate the actual KL divergence from posterior  to prior, D K L p(s |∞,1:2 , 1:2 ) p(s ) . The results of these three experiments are summarized in Fig. 2.6a. As expected, the second experiment, performed at (a) (b) Fig. 2.6 a Experiments corresponding to the three (1 , 2 ) pairs indicated in Fig. 2.5. The posterior densities from the three experiments are marked #1, #2, and #3. b ∞ = F(s /) versus s for  = 0.01, 0.025, 0.095 2 Information-Driven Experimental Design in Materials Science 29 (1 , 2 ) = (0.01, 0.095) is the most informative, and has a large information gain of D K L = 2.10 nats.1 The first experiment is the least informative, with a small information gain of D K L = 0.99 nats. The third experiment, with D K L = 1.44 nats, lies in between. The actual values of D K L are different from their expected values because the expected information gains are calculated by averaging over all possible prior values of s and all possible experimental outcomes, whereas the actual values are calculated only for particular s and ∞ values, given . However, the values of D K L follow the same trend as their expectations. To better understand why these experiments produce different values of the information gain, Fig. 2.6b plots ∞ = F(s /) as a function of s for  = 0.01, 0.025, and 0.095. We observe that ∞ is not very sensitive to variations in s for  = 0.025. This explains why an experiment with (1 , 2 ) = (0.025, 0.025) is not particularly informative. On the other hand, ∞ is sensitive to variations in s for  = 0.095 and  = 0.01. Additionally, ∞ is a decreasing function of s for  = 0.095, and an increasing function for  = 0.01. The complementarity of these trends makes the experiment (1 , 2 ) = (0.01, 0.095) especially useful. We can also compare the optimal experiment to the random experiments shown in Fig. 2.4. The information gained in the optimal experiment (D K L = 2.10 nats), with two values of , is comparable to the information gained from the experiment with eight randomly selected values of  (D K L = 2.29 nats). Hence by using optimal Bayesian experimental design in this example, we are able to reduce the experimental effort over a random strategy by roughly a factor of four! This reduction is especially valuable when experiments are difficult or expensive to conduct. 2.3.2 Heterophase Interfaces: design for model discrimination As noted in Sect. 2.2.1, experiments often yield data that may be explained by multiple models. Additional measurements may then be required to determine which of many possible models is best supported by the data. In such situations, it is desirable to determine which further experiments are likely to distinguish between alternative models most efficiently. Naturally, this guidance is needed before the additional work is actually carried out. Determining which experiments are most informative for distinguishing between alternative models is the goal of Bayesian experimental design for model selection [63, 75]. This capability is especially useful when the experiments are very resource-intensive and brute force data acquisition over a wide parameter range is not feasible. This section will illustrate Bayesian experimental design for model selection on an example taken from investigations of heterophase interfaces in layered metal composites. 1A nat is a unit of information, analogous to a bit, but with a natural logarithm rather than a base two logarithm in (2.5). 30 R. Aggarwal et al. Fig. 2.7 A Cu-Nb multilayer composite synthesized by PVD [31] 2.3.2.1 Physical Background Figure 2.7 shows a multilayer composite of two transition metals—copper (Cu) and niobium (Nb)—created by physical vapor deposition (PVD) [74]. In this synthesis technique, atoms impinge upon a flat substrate, adhere to it, and aggregate into crystalline layers. By alternating the elements being deposited—e.g., first Cu, then Nb, then Cu again, and so on—layered composites such as the one in Fig. 2.7 may be made. The thickness of each layer may be controlled by changing the total deposition time for each element. Many multilayer compositions besides Cu/Nb have been synthesized this way, including Cu/V (V: vanadium) [37, 105], Cu/Mo (Mo: molybdenum) [62], Ag/V (Ag: Silver) [97, 98], Al/Nb (Al: Aluminum) [37, 61, 62], Fe/W (Fe: Iron, W: Tungsten) [60], and others [11, 12, 86]. Layered composites are ideal for studying the properties of heterophase interfaces. In Fig. 2.7, each pair of adjacent Cu and Nb layers forms one Cu-Nb interface. The total amount of interface area per unit volume of the material may be changed by adjusting the thickness of the layers. For composites where all the individual layers have identical thickness l, the volume of material corresponding to interface area A is V = A × l. Thus, the interface area per unit volume is A/V = 1/l: as the layers are made thinner, A/V rises and the influence of interfaces on the physical properties of the composite as a whole increases. For l in the nanometer range, i.e., l  10 nm, interfaces dominate the behavior of the multilayer composite, leading to enhanced strength [73], resistance to radiation [72], and increased fatigue life [95]. In multilayer composites, all of these desirable properties are due to the influence 2 Information-Driven Experimental Design in Materials Science 31 of interfaces. Thus, considerable effort continues to be invested into elucidating the structure and properties of individual interfaces [10, 19, 71]. The present example will consider the relationship between the structure of metalmetal heterophase interfaces and trapping of helium (He) impurities. Implanted He is a major concern for the performance of materials in nuclear energy applications [92, 106]. Trapping and stable storage of these impurities at interfaces is one way of mitigating the deleterious effects of implanted He [32, 78]. The influence of interfaces on He may be clearly seen in multilayer composites. Experiments carried out on CuNb [29], Cu-Mo [62], and Cu-V [38] multilayers synthesized by PVD show that implanted He is preferentially trapped at the interfaces. Moreover, not all interfaces are equally effective at trapping He: the maximum concentration c of interfacial He—expressed as the number of He atoms per unit interface area—that may be trapped at an interface before detectable He precipitates form differs from interface to interface [32]. Figure 2.8 plots c for Cu-Nb, Cu-Mo, and Cu-V interfaces as a function of one parameter: the interface “misfit” m. Cu, Nb, Mo, and V all have cubic crystal structures: Cu is face-centered cubic (fcc) while Nb, Mo, and V are body-centered cubic (bcc). Thus, all three composites used in Fig. 2.8 are made up of alternating fcc (Cu) and bcc (Nb, Mo, or V) layers. The edge length of a single cubic unit cell in fcc or bcc crystals is the lattice parameter, afcc or abcc . The misfit m is defined as m = abcc /afcc . Intuitively, m measures the mismatch in inter-atomic spacing in the adjacent crystals that make up an interface. According to Fig. 2.8, the ability of interfaces to trap He, as measured by c, depends on the misfit: c = c(m). A simple model that may be proposed based on this data is that the relationship between c and m is linear: c = α0 + α1 m. Indeed, a linear fit represents the available data reasonably well. Its most apparent drawback Fig. 2.8 Maximum interfacial He impurity concentration, c, plotted as a function of misfit, m 32 R. Aggarwal et al. is that it predicts negative c values for m  0.83. Thus, a better model might be c = α|m − m 0 |. This model predicts that c drops to zero as m decreases to m 0 and begins to rise again as m is further reduced below m 0 . Physically, this model may be rationalized by stating that at some special value of misfit, m 0 , the atomic matching between adjacent layers is especially good, leading to few sites at the interface where He impurities may be trapped. As m departs from m 0 in either direction, the atomic matching becomes worse, giving rise to more He trapping sites and therefore higher c. The structure of fcc/bcc interfaces—including the degree of atomic matching— may be investigated in detail by constructing atomic-level models [30, 33]. Figure 2.9 shows such a model of the terminal atomic planes of Cu and Nb found at PVD Cu-Nb interfaces. The pattern of overlapping atoms from the two planes contains sites where a Cu atom is nearly on top of a Nb atom. Such sites are thought to be preferential locations for trapping of He impurities [52]. They arise from the geometrical interference of the overlapping atomic arrangements in the adjacent crystalline layers and, in that sense, may be viewed as analogous to features in a Moiré pattern. The density and distribution of He trapping sites of the type shown in Fig. 2.9 may be computed directly for any given fcc/bcc interface as a function of the geometry of the interface using the well-known O-lattice theory [13]. In PVD Cu/Nb, Cu/Mo, and Cu/V composites, the relative orientation of the adjacent crystals is identical. Thus, differences in the geometry Cu-Nb, Cu-Mo, and Cu-V interfaces arise solely from differences in the lattice parameters of the adjacent crystals, as described by the misfit parameter, m. The areal density of He trapping sites for these interfaces is therefore only a function of m and may be written as f (m) =  √  √  (4m − 3)( 3m − 2) 2 3aCu m2 . (2.21) Fig. 2.9 Left A Cu-Nb bilayer. Right The terminal Cu and Nb planes that meet at the Cu-Nb interface. He trapping occurs at sites where a Cu atom is nearly on top of a Nb atom. One such site is shown with the dashed circle 2 Information-Driven Experimental Design in Materials Science 33 Using this expression, we propose a second model for the dependence of c on m, namely: c = β f (m). Here, the proportionality constant β determines the number of He atoms that may be trapped at a single site of the type illustrated in Fig. 2.9. The best fit for this second model is plotted in Fig. 2.8. Both this model and the previously described linear model fit the available experimental data reasonably well. Moreover, both predict c values of zero for m ≈ 0.82–0.83. However, unlike the linear model, c = β f (m) predicts that c is also zero at m ≈ 0.75. We wish to determine what additional experimental data will help distinguish between the two models described above. However, since measuring even a single value of c requires considerable resources, our goal is to limit the additional data to just one (c, m) pair. In an experiment, we may select m by choosing to synthesize a fcc/bcc multilayer composite of specific composition. In other words, we control m. However, we do not know c in advance. In this context, our goal is to determine what one value of m is most likely to distinguish between the two models, regardless of the c value actually found in the subsequent experiment. In the following section, we will apply Bayesian experimental design to address this challenge. In addition to the two models described above, we will also consider a third model encapsulating the hypothesis that c does not depend on m at all: c = γ = constant. 2.3.2.2 Bayesian Experimental Design for Model Selection As described in Sect. 2.2, the goal of experimental design is to maximize the expectation of some measure of information. In the present example, we will maximize the expected KL divergence, as applied to model discrimination, described in (2.6) and (2.7). In this context, m is the experimental parameter η that we control; c is the observed data y; M1 , M2 , and M3 are the competing models; (α, m 0 ) are the parameters θ1 of model M1 ; β is the parameter θ2 of model M2 ; and γ is the parameter θ3 of model M3 . The expected KL divergence can be computed by combining (2.6) and (2.7); this requires knowing both the prior p(Mi ) and the posterior p(Mi |c, m) for each model. We use a flat or “indifference” prior over models, p(Mi ) = 1/3. The posterior model probabilities are calculated from Bayes rule as given in (2.2). Evaluating Bayes’ rule in this setting requires that we calculate the marginal likelihood for each model and proposed experiment, p(c|Mi , m), as shown in (2.3). We now detail this procedure. The previous section identified three models connecting c and m. They are: M1 : c = α|m − m 0 | + 1 (2.22) M2 : M3 : c = β f (m) + 2 c = γ + 3 (2.23) (2.24) where i ∼ N (0, σ2 ). In addition to specifying the functional form of each model, each expression above also contains an additive noise term i . This term is a random variable that describes uncertainty in the measured c, i.e., due to the observational 34 R. Aggarwal et al. process itself. For simplicity, we assume the observational error variance σ2 to be known. The model parameters α, m 0 , β, and γ are endowed with priors that reflect our state of knowledge after performing the three experiments shown in Fig. 2.8, before beginning the current experimental design problem. These priors are taken to be Gaussian. In other words, we suppose that they are the result of Bayesian linear regression with Gaussian priors or improper uniform priors; the posterior following the three previous experiments becomes the prior for the current experimental design problem. We denote the current prior means by ᾱ, m¯0 , β̄, and γ̄, and the current prior standard deviations as σα , σm 0 , σβ , and σγ . Given these assumptions, we can express the probability density of the observation c for each parameterized model as:   (c − α|m − m 0 |)2 exp − p(c|m, α, m 0 , M1 ) = √ 2σ2 2πσ   1 (c − β f (m))2 p(c|m, β, M2 ) = √ exp − 2σ2 2πσ   1 (c − γ)2 exp − p(c|m, γ, M3 ) = √ . 2σ2 2πσ 1 (2.25) (2.26) (2.27) Each of these densities is normal with mean given by the model and variance σ2 . For fixed m and c, these densities can be viewed as the likelihood functions for the corresponding model parameters, i.e., α and m 0 for model 1, β for model 2, and γ for model 3. To obtain the marginal likelihoods p(c|m, Mi ), we marginalize out these parametric dependencies as follows:  ∞ p(c|m, α, m 0 , M1 ) p(α) p(m 0 ) dα dm 0 (2.28) p(c|m, M1 ) = −∞  ∞ p(c|m, β, M2 ) p(β) dβ (2.29) p(c|m, M2 ) = −∞  ∞ p(c|m, γ, M1 ) p(γ) dγ (2.30) p(c|m, M3 ) = −∞ Here, p(α), p(m 0 ), p(β), and p(γ) denote the Gaussian prior probability densities described above, e.g.,   (α − ᾱ)2 1 , etc. (2.31) exp − p(α) = √ 2σα2 2πσα In the expressions for p(c|m, Mi ), integration over α, β, and γ can be performed analytically, e.g., p(c|m, M2 ) =  σβ σ  1 2π σβ2 + 2π f (m)2 σ2  (c − β̄ f (m))2 exp − . 2( f (m)2 σβ2 + σ2 ) (2.32) 2 Information-Driven Experimental Design in Materials Science Table 2.1 Prior model parameters Model Parameter M1 M2 M3 35 Standard deviation ᾱ ≈ m¯0 ≈ 0.83 β̄ ≈ 26/nm2 γ̄ ≈ 4.5/nm2 94/nm2 σα ≈ 0.49/nm2 σm 0 ≈ 0.62 σβ ≈ 4.2/nm2 σγ = 2.0/nm2 The integral over m 0 in the expression for p(c|m, M1 ) must be found numerically, however. In the present example, this integral is easily computed using standard numerical quadrature. If the integral had been too high dimensional, however, then a Monte Carlo scheme might be used instead [83]. We carry out these calculations using prior parameters listed in Table 2.1. The experimental uncertainty was set to σ = 2.5/nm2 , following [29]. To calculate the expected information gain U in the model indicator, as a function of the m value for a single additional experiment, we first substitute the prior and posterior model probabilities calculated above into (2.6). Then we take the expectation of this utility over the prior predictive distribution, as in (2.7), by integrating over the data c. More explicitly, we calculate:  U (m) = u(m, c) p(c|m) dc, (2.33) where the utility is u(m, c) = 3  P(Mi |c, m) log i=1 P(Mi |c, m) , P(Mi ) and the design-dependent prior predictive probability density is p(c|m) = 3  P(Mi ) p(c|m, Mi ). i=1 The integral in (2.33) formally is taken over (−∞, ∞), since this is the range of the prior predictive. Negative values of c are not physical, of course, but they are exceedingly rare: the mean predictions of models 1 and 2 are necessarily positive, and the Gaussian prior on γ in model 3 is almost entirely supported above zero. The Gaussian measurement noise  can also lead to negative c values, but it too has a relatively small variance. Figure 2.10 plots U (m) computed using all three models. For comparison, the figure also shows U (m) found using only models 1 and 2, i.e., excluding the constant model c = γ. Values of m that maximize U (m) are the best choices for an experiment to distinguish between models. When all three models are considered, 36 R. Aggarwal et al. Fig. 2.10 Expected information gain for model discrimination U (m) U (m) is greatest for high misfit, i.e., m ≈ 0.95. By contrast, when only models 1 and 2 are considered, U (m) is least in the high m limit. The reason for this difference is clear from comparing Fig. 2.10 with Fig. 2.8: models 1 and 2 predict comparable c at high m while model 3 predicts a markedly lower c. Thus, when all three models are considered, the value of U (m) is high for m ≈ 0.95 because a measurement at that m value makes it possible to distinguish models 1 and 2 from model 3. By contrast, when only models 1 and 2 are considered, U (m) is least at high m because a measurement in that m range has limited value for distinguishing between models 1 and 2. Putting aside model 3, U (m) predicts greatest utility for an experiment carried out in the range 0.74 < m < 0.84, i.e., in the vicinity of the minima of function f (m). To understand the reason for the significance of this m range, it is important to realize that the plots in Fig. 2.8 only show a single realization of models 1 and 2, namely those corresponding to α = ᾱ, m 0 = m¯0 , and β = β̄ (the prior means on the parameters). Since we assume that α, m 0 , and β are normally distributed, many other realizations of these models are possible. Figure 2.11 shows 100 different realizations of models 1 and 2 obtained by drawing α, m 0 , and β at random from their prior distributions. Figure 2.11 makes clear an important distinction between models 1 and 2 that is not apparent in Fig. 2.8: model 1 exhibits an extreme sensitivity to its fitting parameters within the range of uncertainty of those parameters. In particular, the minimum in c predicted by model 1 may occur at many different m values. By contrast, model 2 is relatively less sensitive to its fitting parameters, especially for 0.74 < m  0.84. Unlike model 1, the locations of its minima are fixed. Thus, measuring a low c value for 0.74 < m  0.84 has the potential to exclude a large number of realizations of 2 Information-Driven Experimental Design in Materials Science 37 Fig. 2.11 100 different realizations of models 1 (c = α|m − m 0 |) and 2 (c = β f (m)) obtained by drawing parameters α, m 0 , and β at random from their prior distributions. The thick lines plot the realizations of the models at the prior mean values of these parameters model 1, while measuring a high c value in that range essentially excludes model 2. Bayesian methods naturally capture this subtle aspect of experimental design without any special prior analysis of the competing models. 2.4 Outlook The examples presented here demonstrate how the formalism of optimal Bayesian experimental design, coupled with information theoretic objectives, can be adapted to different questions of optimal data collection in materials science. In one example, we seek the best pair of experiments for inferring the parameters of a given model. In another example, we seek the single experiment that can best distinguish between competing models, where each model has a distinct form and a distinct set of uncertain parameters. Though simple, these examples also demonstrate the use of key computational and analytical tools, including the Monte Carlo estimation of expected information gain (in the nonlinear and non-Gaussian setting of our first example) and the use of reduced-order models (also in the first example). Reduced order models (ROMs) are increasingly being recognized as crucial to materials science, especially computational materials design [39, 70, 94, 104]. The reason for their utility is clear: strictly speaking, the complete set of degrees of freedom describing a material is the complete set of positions and types of all its constituent atoms. This set defines a design space far too vast to explore. Even if mesoscale entities such as crystal defects (e.g., dislocations [45] or interfaces [91]) or microstructure [1] are used to define the degrees of freedom for design, the resulting design space may nevertheless remain too vast to examine comprehensively. Therefore, it is crucial to identify only those degrees of freedom that significantly affect properties of interest (e.g., those affecting performance metrics in a design) and create a ROM to connect the two. Yet while formal methods of model order 38 R. Aggarwal et al. reduction are well established in many fields of science and engineering (as reviewed in Sect. 2.2.3), the automated and systematic construction of ROMs in materials contexts is in an early stage of development. In practice, most ROMs in materials science are constructed “manually.” The inherently collective and multiscale character of many materials-related phenomena calls for the development and validation of new methods of automatic model order reduction to address materials-specific challenges. Surrogate or reduced-order models are also essential to making Bayesian inference computationally tractable, particularly inference with computationally intensive physics-based models. Indeed, the past several years have seen a steady stream of developments in model reduction for Bayesian inference, mostly in the applied mathematics, computational science, and statistics communities. These include many types of prior-based ROMs [67–69], posterior-focused approximations [59] and projection-based reduced-order models [27], hierarchical surrogates [23], and numerous other approaches [25, 85]. The utility of ROMs is increasingly being recognized in materials-related inference problems as well. For instance, the model film-substrate problem described in Sect. 2.3.1 relies on a ROM to circumvent computationally expensive forward problem evaluations, thereby making rapid inference of substrate properties tractable [2]. Earlier examples of this approach in materials science problems include [28], which, starting with what is effectively a ROM for the energytemperature relation, inferred the melting point of Ti2 GaN. Data-fit surrogates constructed based on existing literature may also serve an analogous purpose to a ROM. Using such a surrogate, [103] modeled the creep rupture life of Ni-base superalloys. Similar to reduced-order modeling, the usefulness of Bayesian approaches is now becoming better recognized within the materials community. They can be applied to parameter inference and model inference, as demonstrated here, but also to problems involving prediction under uncertainty. For example, [55] used Bayesian inference to assess the uncertainty of cluster expansion methods for computing the internal energies of alloys. These authors point out that cluster expansions are also a kind of surrogate model—i.e., a ROM—and that uncertainty quantification should, among other goals, assess how well the surrogate reproduces the output of a more computationally expensive reference model. Despite growing interest in Bayesian methods within the materials community, there are fewer examples of their application to experimental design. An early (yet very recent) effort is [4], which applies information-theoretic criteria and Bayesian methods to stress-strain response and texture evolution in polycrystalline solids. Nevertheless, opportunities for expanded application of optimal Bayesian experimental design abound in materials-related work. In particular, detailed and resourceintensive experiments such as those described in Sect. 2.3.2 are poised to benefit from it immensely. One potential hurdle to widespread adoption is the up-front investment of effort currently needed to understand and implement the associated mathematical formalism. Thus, expanded availability of user-friendly, well-documented, and multi-functional software [79] is likely to accelerate the adoption and integration of Bayesian experimental design into mainstream materials research. Finally, we emphasize that optimal experimental design itself—not limited to the materials science context—is the topic of much current research. This research 2 Information-Driven Experimental Design in Materials Science 39 focuses on both formulational issues and on computational methodology. Examples of the latter include developing reduced-order or multi-fidelity models tailored to the needs of stochastic optimization, or devising more efficient estimators of expected information gain, using importance sampling, high-dimensional kernel density estimators, and other approaches. An interesting foundational challenge, on the other hand, involves understanding and accounting for model error or misspecification in optimal design. If the model relating parameters of interest to experimental observables is incomplete or under-resolved, how useful—or close to optimal—are the experiments designed according to this relationship? When a convergent sequence of models of differing fidelity is available (as in the ROM setting), then this question is more tractable. But if all available models are inadequate, many questions remain open. One promising approach to this challenge uses nonparametric statistical models, perhaps formulated in a hierarchical Bayesian manner, to account for interactions and inputs missing from the current model of the experiment. Sequential experimental design is also useful in this context, as successive batches of experiments can help uncover the unmodeled mismatch between a model and physical reality. Sequential experimental design is useful much more broadly as well. Recall that in all the examples of this chapter, we designed a single batch of experiments all-at-once: even if the batch contained multiple experiments, we chose the design parameters before performing any of the experiments. Sequential design, in contrast, allows information from each experiment to influence the design of the next. The most widely used sequential approaches are greedy, where one designs the next batch of experiments as if it were the final batch—using the current state of knowledge as the prior distribution, with design criteria similar to those used here. But greedy approaches are sub-optimal in general, as they do not account for the information to be gained from future experiments. An optimal approach can instead be obtained by formulating sequential experimental design as a problem of dynamic programming [7, 9, 20]. Making this dynamic programming approach computationally tractable, outside of specialized settings, remains a significant challenge. References 1. B.L. Adams, S.R. Kalidindi, D.T. Fullwood, Microstructure Sensitive Design for Performance Optimization (Butterworth-Heineman, Newton, 2012) 2. R. Aggarwal, M. Demkowicz, Y. Marzouk, Bayesian inference of substrate properties from film behavior. Model. Simul. Mater. Sci. Eng. 23, 015009 (2015) 3. J. Aizenberg, A. Black, G. Whitesides, Controlling local disorder in self-assembled monolayers by patterning the topography of their metallic supports. Nature 394, 868–871 (1998) 4. S. Atamturktur, J. Hegenderfer, B. Williams, C. Unal, Selection criterion based on an exploration-exploitation approach for optimal design of experiments. J. Eng. Mech. 141 (2014) 5. A.C. Atkinson, A.N. Donev, Optimum Experimental Designs, Oxford Statistical Science Series (Oxford University Press, Oxford, 1992) 6. G. Bayraksan, D.P. Morton, Assessing solution quality in stochastic programs via sampling. INFORMS Tutor. Oper. Res. 5, 102–122 (2009) 40 R. Aggarwal et al. 7. I. Ben-Gal, M. Caramanis, Sequential doe via dynamic programming. IIE Trans. 34, 1087– 1100 (2002) 8. J. Berger, L. Pericchi, Objective Bayesian methods for model selection: introduction and comparison, in Model Selection, IMS Lecture Notes—Monograph Series, ed. by P. Lahiri (2001), pp. 135–207 9. D.P. Bertsekas, Dynamic Programming and Optimal Control, 3rd edn. (Athena Scientific, Belmont, 2007) 10. I. Beyerlein, M. Demkowicz, A. Misra, B. Uberuaga, Defect-interface interactions. Progr. Mater. Sci. (2015) 11. D. Bhattacharyya, N. Mara, P. Dickerson, R. Hoagland, A. Misra, Transmission electron microscopy study of the deformation behavior of Cu/Nb and Cu/Ni nanoscale multilayers during nanoindentation. J. Mater. Res. 24, 1291–1302 (2009) 12. D. Bhattacharyya, N. Mara, P. Dickerson, R. Hoagland, A. Misra, Compressive flow behavior of Al-TiN multilayers at nanometer scale layer thickness. Acta Mater. 59, 3804–3816 (2011) 13. W. Bollmann, Crystal Defects and Crystalline Interfaces (Springer, Berlin, 1970) 14. N. Bowden, S. Brittain, A. Evans, J. Hutchinson, G. Whitesides, Spontaneous formation of ordered structures in thin films of metals supported on an elastomeric polymer. Nature 393, 146–149 (1998) 15. G.E.P. Box, H.L. Lucas, Design of experiments in non-linear situations. Biometrika 46, 77–90 (1959) 16. T. Bui-Thanh, O. Ghattas, J. Martin, G. Stadler, A computational framework for infinitedimensional Bayesian inverse problems part I: the linearized case, with application to global seismic inversion. SIAM J. Sci. Comput. 35, A2494–A2523 (2013) 17. T. Bui-Thanh, K. Willcox, O. Ghattas, Model reduction for large-scale systems with highdimensional parametric input space. SIAM J. Sci. Comput. 30, 3270–3288 (2008) 18. J. Cahn, J. Hilliard, Free energy of a nonuniform system. I. Interfacial free energy. J. Chem. Phys. 28, 258–267 (1958) 19. P.R. Cantwell, M. Tang, S.J. Dillon, J. Luo, G.S. Rohrer, M.P. Harmer, Grain boundary complexions. Acta Mater. 62, 1–48 (2014) 20. B.P. Carlin, J.B. Kadane, A.E. Gelfand, Approaches for optimal sequential decision analysis in clinical trials. Biometrics, pp. 964–975 (1998) 21. K. Chaloner, I. Verdinelli, Bayesian experimental design: a review. Stat. Sci. 10, 273–304 (1995) 22. S. Chaturantabut, D.C. Sorensen, Nonlinear model reduction via discrete empirical interpolation. SIAM J. Sci. Comput. 32, 2737–2764 (2010) 23. J.A. Christen, C. Fox, MCMC using an approximation. J. Comput. Graph. Stat. 14, 795–810 (2005) 24. P. Conrad, Y.M. Marzouk, Adaptive Smolyak pseudospectral approximations. SIAM J. Sci. Comput. 35, A2643–A2670 (2013) 25. P. Conrad, Y.M. Marzouk, N. Pillai, A. Smith, Accelerating asymptotically exact MCMC for computationally intensive models via local approximations. J. Am. Stat. Assoc. submitted (2014). arXiv:1402.1694 26. T.M. Cover, J.A. Thomas, Elements of Information Theory, 2nd edn. (Wiley, Hoboken, 2006) 27. T. Cui, Y.M. Marzouk, K. Willcox, Data-driven model reduction for the Bayesian solution of inverse problems. Int. J. Numer. Methods Eng. 102, 966–990 (2015) 28. S. Davis et al., Bayesian inference as a tool for analysis of first-principles calculations of complex materials: an application to the melting point of ti2gan. Model. Simul. Mater. Sci. Eng. 21, 075001 (2013) 29. M. Demkowicz, D. Bhattacharyya, I. Usov, Y. Wang, M. Nastasi, A. Misra, The effect of excess atomic volume on he bubble formation at fcc-bcc interfaces. Appl. Phys. Lett. 97, 161903–161903 (2010) 30. M. Demkowicz, R. Hoagland, Structure of kurdjumov-sachs interfaces in simulations of a copper-niobium bilayer. J. Nucl. Mater. 372, 45–52 (2008) 2 Information-Driven Experimental Design in Materials Science 41 31. M. Demkowicz, R. Hoagland, B. Uberuaga, A. Misra, Influence of interface sink strength on the reduction of radiation-induced defect concentrations and fluxes in materials with large interface area per unit volume. Phys. Rev. B 84, 104102 (2011) 32. M. Demkowicz, A. Misra, A. Caro, The role of interface structure in controlling high Helium concentrations. Current Opin. Solid State Mater. Sci. 16, 101–108 (2012) 33. M.J. Demkowicz, J. Wang, R.G. Hoagland, Interfaces between dissimilar crystalline solids. Dislocat. Solids 14, 141–205 (2008) 34. M. Eldred, S. Giunta, S. Collis, Second-order corrections for surrogate-based optimization with model hierarchies, in AIAA Paper 2004-4457, Proceedings of the 10th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference (2004) 35. D. Eyre, An unconditionally stable one-step scheme for gradient systems. Unpublished manuscript, University of Utah, Salk Lake City, June (1998) 36. I. Ford, D.M. Titterington, K. Christos, Recent advances in nonlinear experimental design. Technometrics 31, 49–60 (1989) 37. E. Fu, N. Li, A. Misra, R. Hoagland, H. Wang, X. Zhang, Mechanical properties of sputtered cu/v and al/nb multilayer films. Mater. Sci. Eng.: A, 493, 283—287 (2008). Mechanical Behavior of Nanostructured Materials, a Symposium Held in Honor of Carl Koch at the TMS Annual Meeting 2007, Orlando, Florida 38. E. Fu, A. Misra, H. Wang, L. Shao, X. Zhang, Interface enabled defects reduction in helium ion irradiated Cu/V nanolayers. J. Nucl. Mater. 407, 178–188 (2010) 39. L.D. Gabbay, S. Senturia, Computer-aided generation of nonlinear reduced-order dynamic macromodels. I. Non-stress-stiffened case, Microelectromech. Syst. J. 9, 262–269 (2000) 40. T. Gerstner, M. Griebel, Dimension-adaptive tensor-product quadrature. Computing 71, 65– 87 (2003) 41. R. Ghanem, P. Spanos, Stochastic Finite Elements: A Spectral Approach (Springer, Berlin, 1991) 42. J. Ginebra, On the measure of the information in a statistical experiment. Bayesian Anal. 2, 167–212 (2007) 43. M. Grepl, Y. Maday, N. Nguyen, A. Patera, Efficient reduced-basis treatment of nonaffine and nonlinear partial differential equations. Math. Model. Numer. Anal. (M2AN) 41, 575–605 (2007) 44. G.E. Hilley, R. Bürgmann, P.-Z. Zhang, P. Molnar, Bayesian inference of plastosphere viscosities near the kunlun fault, northern tibet, Geophys. Res. Lett. 32, n/a–n/a (2005) 45. J. Hirth, J. Lothe, Theory of Dislocations (Wiley, New York, 1992) 46. J.A. Hoeting, D. Madigan, A.E. Raftery, C.T. Volinsky, Bayesian model averaging: a tutorial. Stat. Sci. 14, 382–417 (1999) 47. P. Holmes, J. Lumley, G. Berkooz, Turbulence, Coherent Structures, Dynamical Systems and Symmetry (Cambridge University Press, Cambridge, 1996) 48. X. Huan, Y.M. Marzouk, Simulation-based optimal Bayesian experimental design for nonlinear systems. J. Comput. Phys. 232, 288–317 (2013) 49. X. Huan, Y.M. Marzouk, Gradient-based stochastic optimization methods in Bayesian experimental design. Int. J. Uncertain. Quantif. 4, 479–510 (2014) 50. H. Jeffreys, An invariant form for the prior probability in estimation problems, in Proceedings of the Royal Society (1946) 51. D.R. Jones, M. Schonlau, W.J. Welch, Efficient global optimization of expensive black-box functions. J. Global Optim. 13, 455–492 (1998) 52. A. Kashinath, A. Misra, M. Demkowicz, Stable storage of helium in nanoscale platelets at semicoherent interfaces. Phys. Rev. Lett. 110, 086101 (2013) 53. M.C. Kennedy, A. O’Hagan, Bayesian calibration of computer models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 63, 425–464 (2001) 42 R. Aggarwal et al. 54. J. Kiefer, J. Wolfowitz, Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23, 462–466 (1952) 55. J. Kristensen, N.J. Zabaras, Bayesian uncertainty quantification in the evaluation of alloy properties with the cluster expansion method. Comput. Phys. Commun. 185, 2885–2892 (2014) 56. H. Kushner, G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, Applications of mathematics (Springer, Berlin, 2003) 57. O.P. Le Maître, O.M. Knio, Spectral Methods for Uncertainty Quantification: With Applications to Computational Fluid Dynamics (Springer, Berlin, 2010) 58. J. Lewandowski, A. Greer, Temperature rise at shear bands in metallic glasses. Nat. Mater. 5, 15–18 (2006) 59. J. Li, Y.M. Marzouk, Adaptive construction of surrogates for the Bayesian solution of inverse problems. SIAM J. Sci. Comput. 36, A1163–A1186 (2014) 60. N. Li, E. Fu, H. Wang, J. Carter, L. Shao, S. Maloy, A. Misra, X. Zhang, He ion irradiation damage in Fe/W nanolayer films. J. Nucl. Mater. 389, 233–238 (2009) 61. N. Li, M. Martin, O. Anderoglu, A. Misra, L. Shao, H. Wang, X. Zhang, He ion irradiation damage in Al/ Nb multilayers. J. Appl. Phys. 105, 123522 (2009) 62. N. Li, J. Wang, J. Huang, A. Misra, X. Zhang, In situ tem observations of room temperature dislocation climb at interfaces in nanolayered Al/Nb composites. Scripta Materialia 63, 363– 366 (2010) 63. D.V. Lindley, Bayesian Statistics, A Review (Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 1972) 64. T.J. Loredo, Rotating stars and revolving planets: Bayesian exploration of the pulsating sky, in Bayesian Statistics 9: Proceedings of the Nineth Valencia International Meeting, Oxford University Press (2010), pp. 361–392 65. T.J. Loredo, D.F. Chernoff, Bayesian adaptive exploration, in Statistical Challenges of Astronomy (Springer, Berlin, 2003), pp. 57–69 66. D.J. MacKay, Information theory, inference, and learning algorithms, vol. 7, Citeseer (2003) 67. Y.M. Marzouk, H.N. Najm, Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems. J. Comput. Phys. 228, 1862–1902 (2009) 68. Y.M. Marzouk, H.N. Najm, L.A. Rahn, Stochastic spectral methods for efficient Bayesian solution of inverse problems. J. Comput. Phys. 224, 560–586 (2007) 69. Y.M. Marzouk, D. Xiu, A stochastic collocation approach to Bayesian inference in inverse problems. Commun. Comput. Phys. 6, 826–847 (2009) 70. J.E. Mehner, L.D. Gabbay, S.D. Senturia, Computer-aided generation of nonlinear reducedorder dynamic macromodels. II. Stress-stiffened case. Microelectromech. Syst. J. 9, 270–278 (2000) 71. Y. Mishin, M. Asta, J. Li, Atomistic modeling of interfaces and their impact on microstructure and properties. Acta Materialia 58, 1117–1151 (2010) 72. A. Misra, M. Demkowicz, X. Zhang, R. Hoagland, The radiation damage tolerance of ultrahigh strength nanolayered composites. Jom 59, 62–65 (2007) 73. A. Misra, J. Hirth, R. Hoagland, Length-scale-dependent deformation mechanisms in incoherent metallic multilayered composites. Acta Materialia 53, 4817–4824 (2005) 74. T.E. Mitchell, Y.C. Lu, A.J.G. Jr, M. Nastasi, H. Kung, Structure and mechanical properties of copper/niobium multilayers. J. Am. Ceram. Soc. 80, 1673–1676 (1997) 75. J.I. Myung, M.A. Pitt, Optimal experimental design for model discrimination. Psychol. Rev. 116, 499–518 (2009) 76. J.A. Nelder, R. Mead, A simplex method for function minimization. Comput. J. 7, 308–313 (1965) 77. A. Noor, J. Peters, Reduced basis technique for nonlinear analysis of structures. AIAA J. 18, 455–462 (1980) 2 Information-Driven Experimental Design in Materials Science 43 78. G. Odette, M. Alinger, B. Wirth, Recent developments in irradiation-resistant steels. Annu. Rev. Mater. Res. 38, 471–503 (2008) 79. M. Parno, P. Conrad, A. Davis, Y.M. Marzouk, MIT uncertainty quantification (MUQ) library. http://bitbucket.org/mituq/muq 80. V. Picheny, D. Ginsbourger, Y. Richet, G. Caplin, Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics 55, 2–13 (2013) 81. Z.-Q. Qu, Model Order Reduction Techniques with Applications in Finite Element Analysis: With Applications in Finite Element Analysis (Springer Science & Business Media, Berlin, 2004) 82. C. Rasmussen, C. Williams, Gaussian Processes for Machine Learning (The MIT Press, Cambridge, 2006) 83. C.P. Robert, G. Casella, Monte Carlo Statistical Methods (Springer, Berlin, 2004) 84. K.J. Ryan, Estimating expected information gains for experimental designs with application to the random fatigue-limit model. J. Comput. Graph. Stat. 12, 585–603 (2003) 85. C. Schwab, A.M. Stuart, Sparse deterministic approximation of Bayesian inverse problems. Inv. Prob. 28, 045003 (2012) 86. S. Shao, H. Zbib, I. Mastorakos, D. Bahr, The void nucleation strengths of the cu-ni-nb-based nanoscale metallic multilayers under high strain rate tensile loadings. Comput. Mater. Sci. 82, 435–441 (2014) 87. A. Shapiro, Asymptotic analysis of stochastic programs. Ann. Oper. Res. 30, 169–186 (1991) 88. L. Sirovich, Turbulence and the dynamics of coherent structures. Part 1: coherent structures. Q. Appl. Math. 45, 561–571 (1987) 89. A. Solonen, H. Haario, M. Laine, Simulation-based optimal design using a response variance criterion. J. Comput. Graph. Stat. 21, 234–252 (2012) 90. J.C. Spall, An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins APL Tech. Dig. 19, 482–492 (1998) 91. A. Sutton, R. Balluffi, Interfaces in Crystalline Materials, Monographs on the Physics and Chemistry of Materials (Clarendon Press, Oxford, 1995) 92. H. Ullmaier, The influence of helium on the bulk properties of fusion reactor structural materials. Nucl. Fus. 24, 1039 (1984) 93. J. van den Berg, A. Curtis, J. Trampert, Optimal nonlinear Bayesian experimental design: an application to amplitude versus offset experiments. Geophy. J. Int. 155, 411–421 (2003) 94. A. Vattré, N. Abdolrahim, K. Kolluri, M. Demkowicz, Computational design of patterned interfaces using reduced order models. Sci. Rep. 4 (2014) 95. Y.-C. Wang, A. Misra, R. Hoagland, Fatigue properties of nanoscale Cu/Nb multilayers. Scripta materialia 54, 1593–1598 (2006) 96. B.P. Weaver, B.J. Williams, C.M. Anderson-Cook, D.M. Higdon, Computational enhancements to Bayesian design of experiments using Gaussian processes. Bayesian Analysis (2015) 97. Q. Wei, N. Li, N. Mara, M. Nastasi, A. Misra, Suppression of irradiation hardening in nanoscale V/Ag multilayers. Acta Materialia 59, 6331–6340 (2011) 98. Q. Wei, A. Misra, Transmission electron microscopy study of the microstructure and crystallographic orientation relationships in v/ag multilayers. Acta Materialia 58, 4871–4882 (2010) 99. B. Williams, D. Higdon, J. Gattiker, L. Moore, M. McKay, S. Keller-McNulty, Combining experimental data and computer simulations, with an application to flyer plate experiments. Bayesian Anal. 1, 765–792 (2006) 100. D. Xiu, Efficient collocational approach for parametric uncertainty analysis. Commun. Comput. Phys. 2, 293–309 (2007) 101. D. Xiu, J.S. Hesthaven, High-order collocation methods for differential equations with random inputs. SIAM J. Sci. Comput. 27, 1118–1139 (2005) 102. L. Yarin, The Pi-Theorem: Applications to Fluid Mechanics and Heat and Mass Transfer, vol. 1 (Springer, Berlin, 2012) 103. Y. Yoo, C. Jo, C. Jones, Compositional prediction of creep rupture life of single crystal ni base superalloy by Bayesian neural network, Materials Science and Engineering, pp. 22–29 (2001) 44 R. Aggarwal et al. 104. D. Yuryev, M. Demkowicz, Computational design of solid-state interfaces using o-lattice theory: an application to mitigating helium-induced damage. Appl. Phys. Lett. 105, 221601 (2014) 105. X. Zhang, E. Fu, A. Misra, M. Demkowicz, Interface-enabled defect reduction in he ion irradiated metallic multilayers. Jom 62, 75–78 (2010) 106. S. Zinkle, N. Ghoniem, Operating temperature windows for fusion reactor structural materials. Fusion Eng. Des. 51, 55–71 (2000) Chapter 3 Bayesian Optimization for Materials Design Peter I. Frazier and Jialei Wang Abstract We introduce Bayesian optimization, a technique developed for optimizing time-consuming engineering simulations and for fitting machine learning models on large datasets. Bayesian optimization guides the choice of experiments during materials design and discovery to find good material designs in as few experiments as possible. We focus on the case when materials designs are parameterized by a low-dimensional vector. Bayesian optimization is built on a statistical technique called Gaussian process regression, which allows predicting the performance of a new design based on previously tested designs. After providing a detailed introduction to Gaussian process regression, we describe two Bayesian optimization methods: expected improvement, for design problems with noise-free evaluations; and the knowledge-gradient method, which generalizes expected improvement and may be used in design problems with noisy evaluations. Both methods are derived using a value-of-information analysis, and enjoy one-step Bayes-optimality. 3.1 Introduction In materials design and discovery, we face the problem of choosing the chemical structure, composition, or processing conditions of a material to meet design criteria. The traditional approach is to use iterative trial and error, in which we (1) choose some material design that we think will work well based on intuition, past experience, or theoretical knowledge; (2) synthesize and test the material in physical experiments; and (3) use what we learn from these experiments in choosing the material design to try next. This iterative process is repeated until some combination of success and exhaustion is achieved. P.I. Frazier (B) · J. Wang School of Operations Research & Information Engineering, Cornell University, Ithaca, NY14853, USA e-mail: pf98@cornell.edu © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_3 45 46 P.I. Frazier and J. Wang While trial and error has been extremely successful, we believe that mathematics and computation together promise to accelerate the pace of materials discovery, not by changing the fundamental iterative nature of materials design, but by improving the choices that we make about which material designs to test, and by improving our ability to learn from previous experimental results. In this chapter, we describe a collection of mathematical techniques, based on Bayesian statistics and decision theory, for augmenting and enhancing the trial and error process. We focus on one class of techniques, called Bayesian optimization (BO), or Bayesian global optimization (BGO), which use machine learning to build a predictive model of the underlying relationship between the design parameters of a material and its properties, and then use decision theory to suggest which design or designs would be most valuable to try next. The most well-developed Bayesian optimization methods assume that (1) the material is described by a vector of continuous variables, as is the case, e.g., when choosing ratios of constituent compounds, or choosing a combination of temperature and pressure to use during manufacture; (2) we have a single measure of quality that we wish to make as large as possible; and (3) the constraints on feasible materials designs are all known, so that any unknown constraints are incorporated into the quality measure. There is also a smaller body of work on problems that go beyond these assumptions, either by considering discrete design decisions (such as small molecule design), multiple competing objectives, or by explicitly allowing unknown constraints. Bayesian optimization was pioneered by [1], with early development through the 1970s and 1980s by Mockus and Zilinskas [2, 3]. Development in the 1990s was marked by the popularization of Bayesian optimization by Jones, Schonlau, and Welch, who, building on previous work by Mockus, introduced the Efficient Global Optimization (EGO) method [4]. This method became quite popular and wellknown in engineering, where it has been adopted for design applications involving time-consuming computer experiments, within a broader set of methods designed for optimization of expensive functions [5]. In the 2000s, development of Bayesian optimization continued in statistics and engineering, and the 2010s have seen additional development from the machine learning community, where Bayesian optimization is used for tuning hyperparameters of computationally expensive machine learning models [6]. Other introductions to Bayesian optimization may be found in the tutorial article [7] and textbooks [8, 9], and an overview of the history of the field may be found in [10]. We begin in Sect. 3.2 by introducing the precise problem considered by Bayesian Optimization. We then describe in Sect. 3.3 the predictive technique used by Bayesian Optimization, which is called Gaussian Process (GP) regression. We then show, in Sect. 3.4, how Bayesian Optimization recommends which experiments to perform. In Sect. 3.5 we provide an overview of software packages, both freely available and commercial, that implement the Bayesian Optimization methods described in this chapter. We offer closing remarks in Sect. 3.6. 3 Bayesian Optimization for Materials Design 47 3.2 Bayesian Optimization Bayesian optimization considers materials designs parameterized by a d-dimensional vector x. We suppose that the space of materials designs in which x takes values is a known set A ⊆ Rd . For example, x = (x(1), . . . , x(d)) could give the ratio of each of d different constituents mixed together to create d some aggregate material. In this case, we would x(i) = 1}. As another example, setting d = 2, choose A to be the set A = {x : i=1 x = (x(1), x(2)) could give the temperature (x(1)) and pressure (x(2)) used in material processing. In this case, we would choose A to be the rectangle bounded by the experimental setup’s minimum and maximal achievable temperature on one axis, Tmin and Tmax , and the minimum and maximum achievable pressure on the other. As a final example, we could let x = (x(1), . . . , x(d) be the temperatures used in some annealing schedule, assumed to be decreasing over time. In this case, we would set A to be the set {x : Tmax ≥ x(1) ≥ · · · ≥ x(d) ≥ Tmin }. Let f (x) be the quality of the material with design parameter x. The function f is unknown, and observing f (x) requires synthesizing material design x and observing its quality in a physical experiment. We would like to find a design x for which f (x) is large. That is, we would like to solve max f (x). x∈A (3.1) This is challenging because evaluating f (x) is typically expensive and timeconsuming. While the time and expense depends on the setting, synthesizing and testing a new material design could easily take days or weeks of effort and thousands of dollars of materials. In Bayesian optimization, we use mathematics to build a predictive model for the function f based on observations of previous materials designs, and then use this predictive model to recommend a materials design that would be most valuable to test next. We first describe this predictive model in Sect. 3.3, which is performed using a machine learning technique called Gaussian process regression. We then describe, in Sect. 3.4, how this predictive model is used to recommend which design to test next. 3.3 Gaussian Process Regression The predictive piece of Bayesian optimization is based on a machine learning technique called Gaussian process regression. This technique is a Bayesian version of a frequentist technique called kriging, introduced in the geostatistics literature by South-African mining engineer Daniel Krige [11], and popularized later by Matheron and colleagues [12], as described in [13]. A modern monograph on 48 P.I. Frazier and J. Wang Gaussian process regression is [14], and a list of software implementing Gaussian process regression may be found at [15]. In Gaussian process regression, we seek to predict f (x) based on observations at previously evaluated points, call them x1 , . . . , xn . We first treat the case where f (x) can be observed exactly, without noise, and then later treat noise in Sect. 3.3.5. In this noise-free case, our observations are yi = f (xi ) for i = 1, . . . , n. Gaussian process regression is a Bayesian statistical method, and in Bayesian statistics we perform inference by placing a so-called prior probability distribution on unknown quantities of interest. The prior probability distribution is often called, more simply, the prior distribution or, even more simply, the prior. This prior distribution is meant to encode our intuition or domain expertise regarding which values for the unknown quantity of interest are most likely. We then use Bayes rule, together with any data observed, to calculate a posterior probability distribution on these unknowns. For a broader introduction to Bayesian statistics, see the textbook [16] or the research monograph [17]. In Gaussian process regression, if we wish to predict the value of f at a single candidate point x ∗ , it is sufficient to consider our unknowns to be the values of f at the previously evaluated points, x1 , . . . , xn , and the new point x ∗ at which we wish to predict. That is, we take our unknown quantity of interest to be the vector ( f (x1 ), . . . , f (xn ), f (x ∗ )). We then take our data, which is f (x1 ), . . . , f (xn ), and use Bayes rule to calculate a posterior probability distribution on the full vector of interest, ( f (x1 ), . . . , f (xn ), f (x ∗ )), or, more simply, just on f (x ∗ ). To calculate the posterior, we must first specify the prior, which Gaussian process regression assumes to be multivariate normal. It calculates the mean vector of this multivariate normal prior distribution using a function, called the mean function and written here as μ0 (·), which takes a single x as an argument. It applies this mean function to each of the points x1 , . . . , xn , x ∗ to create an n + 1-dimensional column vector. Gaussian process regression creates the covariance matrix of the multivariate normal prior distribution using another function, called the covariance function or covariance kernel and written here as Σ0 (·, ·), which takes a pair of points x, x  as arguments. It applies this covariance function to every pair of points in x1 , . . . , xn , x to create an (n + 1) × (n + 1) matrix. Thus, Gaussian process regression sets the prior probability distribution to, ⎛⎡ ⎤ ⎤ ⎡ Σ (x , x ) · · · Σ (x , x ) Σ (x , x ∗ ) ⎤⎞ 0 1 1 0 1 n 0 1 μ0 (x1 ) f (x1 ) ⎥⎟ ⎜⎢ ⎢ . . .. . ⎥ ⎢ ... ⎥ . . . ⎥⎟ . . . . ⎥ ∼ Normal ⎜ ⎢ ... ⎥,⎢ ⎢ ⎥⎟ ⎜ ⎢ ⎣ f (xn )⎦ ⎝⎣μ0 (xn )⎦ ⎣Σ0 (xn , x1 ) · · · Σ0 (xn , xn ) Σ0 (xn , x ∗ )⎦⎠ f (x ∗ ) μ0 (x ∗ ) Σ0 (x ∗ , x1 ) · · · Σ0 (x ∗ , xn ) Σ0 (x ∗ , x ∗ ) ⎡ (3.2) The subscript “0” in μ0 and Σ0 indicate that these functions are relevant to the prior distribution, before any data has been collected. 3 Bayesian Optimization for Materials Design 49 We now discuss how the mean and covariance functions are chosen, focusing on the covariance function first because it tends to be more important in getting good results from Gaussian process regression. 3.3.1 Choice of Covariance Function In choosing the covariance function Σ0 (·, ·), we wish to satisfy two requirements. The first is that it should encode the belief that points x and x  near each other tend to have more similar values for f (x) and f (x  ). To accomplish this, we want the covariance matrix in (3.2) to have entries that are larger for pairs of points that are closer together, and closer to 0 for pairs of points that are further apart. The second is that the covariance function should always produce positive semidefinite covariance matrices in the multivariate normal prior. That is, if Σ is the covariance matrix in (3.2), then we require that a T Σa ≥ 0 for all column vectors a (where a is assumed to have the appropriate length, n + 1). This requirement is necessary to ensure that the multivariate normal prior distribution is a well-defined probability distribution, because if θ is multivariate normal with mean vector μ and covariance matrix Σ, then the variance of a · θ is a T Σa, and we require variances to be non-negative. Several covariance functions satisfy these two requirements. The most commonly used is called the squared exponential, or Gaussian kernel, and is given by,   Σ0 (x, x ) = α exp − d   βi (xi − xi )2 . (3.3) i=1 This kernel is parameterized by d + 1 parameters: α, and β1 , . . . , βd . The parameter α > 0 controls how much overall variability there is in the function f . We observe that under the prior, the variance of f (x) is Var( f (x)) = Cov( f (x), f (x)) = α. Thus, when α is large, we are encoding in our prior distribution that f (x) is likely to take a larger range of values. The parameters βi > 0 controls how quickly the function f varies with x. For example, consider the relationship between some point x and another point x  = x + [1, 0, . . . , 0]. When β1 is small (close to 0), the covariance between f (x) and f (x  ) is α exp(−β1 ) ≈ α, giving a correlation between f (x) and f (x  ) of nearly 1. This reflects a belief that f (x) and f (x  ) are likely to be very similar, and that learning the value of f (x) will also teach us a great deal about f (x  ). In contrast, when β1 is large, the covariance between f (x) and f (x  ) is nearly 0, given a correlation between f (x) and f (x  ) that is also nearly 0, reflecting a belief that f (x) and f (x  ) are unrelated to each other, and learning something about f (x) will teach us little about (x  ). 50 P.I. Frazier and J. Wang Going beyond the squared exponential kernel There are several other possibilities for the covariance kernel beyond the squared exponential kernel, which encode different assumptions about the underlying behavior of the function f . One particularly useful generalization of the squared exponential covariance kernel is the Matérn covariance kernel, which allows more flexibility in modeling the smoothness of f .    xi −xi 2 Before describing this kernel, let r = be the Euclidean distance i βi between x and x  , but where we have altered the length scale in each dimension by some strictly positive parameter βi . Then,  squared exponential covariance kernel  the can be written as, Σ0 (x, x  ) = α exp −r 2 . With this notation, the Matérn covariance kernel is, Σ0 (x, x  ) = α √ ν  21−ν √ 2νr K ν 2νr , Γ (ν) where K ν is the modified Bessel function. If we take the limit as ν → ∞, we obtain the squared exponential kernel ([14], Sect. 4.2 p. 85). The Matérn covariance kernel is useful because it allows modeling the smoothness of f in a more flexible way, as compared with the squared exponential kernel. Under the squared exponential covariance kernel, the function f is infinitely mean-square differentiable,1 which may not be an appropriate assumption in many applications. In contrast, under the Matérn covariance kernel, f is k-times mean-square differentiable if and only if ν > k. Thus, we can model a function that is twice differentiable but no more by choosing ν = 5/2, and a function that is once differentiable but no more by choosing ν = 3/2. While the squared exponential and Matérn covariance kernels allow modeling a wide range of behaviors, and together represent a toolkit that will handle a wide variety of applications, there are other covariance kernels. For a thorough discussion of these, see Chap. 4 of [14]. Both the Matérn and squared exponential covariance kernel require choosing parameters. While it certainly is possible for one to choose the parameters α and βi (and ν in the case of Matérn) based on one’s intuition about f , and what kinds of variability f is likely to have in a particular application, it is more common to choose these parameters (especially α and βi ) adaptively, so as to best fit previously observed points. We discuss this more below in Sect. 3.3.6. First, however, we discuss the choice of the mean function. 1 Being “mean-square differentiable” at x in the direction given by the unit vector ei means that the limit limδ→0 ( f (x + δei ) − f (x))/δ exists in mean square. Being “k-times mean-square differentiable” is defined analogously. 3 Bayesian Optimization for Materials Design 51 3.3.2 Choice of Mean Function We now discuss choosing the mean function μ0 (·). Perhaps the most common choice is to simply set the mean function equal to a constant, μ. This constant must be estimated, along with parameters of the covariance kernel such as α and βi , and is discussed in Sect. 3.3.6. Beyond this simple choice, if one believes that there will be trends in f that can be described in a parametric way, then it is useful to include trend terms into the mean function. This is accomplished by choosing μ0 (x) = μ + J  γ j Ψ j (x), j=1 where Ψ j (·) are known functions, and γ j ∈ R, along with μ ∈ R, are parameters that must be estimated. A common choice for the Ψ j , if one chooses to include them, are polynomials in x up to some small order. For example, if d = 2, so x is two-dimensional, then one might include all polynomials up to second order, Ψ1 (x) = x1 , Ψ2 (x) = x2 , Ψ3 (x) = (x1 )2 , Ψ4 (x) = (x2 )2 , Ψ5 (x) = x1 x2 , setting J = 5. One recovers the constant mean function by setting J = 0. 3.3.3 Inference Given the prior distribution (3.2) on f (x1 ), . . . , f (xn ), f (x ∗ ), and given (noise-free) observations of f (x1 ), . . . , f (xn ), the critical step in Gaussian process regression is calculating the posterior distribution on f (x ∗ ). We rely on the following general result about conditional probabilities and multivariate normal distributions. Its proof, which may be found in the Derivations and Proofs section, relies on Bayes rule and algebraic manipulation of the probability density of the multivariate normal distribution. Proposition 1 Let θ be a k-dimensional multivariate normal random column vector, with mean vector μ and covariance matrix Σ. Let k1 ≥ 1, k2 ≥ 1 be two integers summing to k. Decompose θ, μ and Σ as  θ[1] , θ[2]   θ= μ=  μ[1] , μ[2]  Σ=  Σ[1,1] Σ[1,2] , Σ[2,1] Σ[2,2] so that θ[i] and μ[i] are ki -column vectors, and Σ[i, j] is a ki × k j matrix, for each i, j = 1, 2. 52 P.I. Frazier and J. Wang If Σ1,1 and Σ2,2 are invertible, then, for any u ∈ Rk1 , the conditional distribution of θ[2] given that θ[1] = u is multivariate normal with mean −1 (u − μ[1] ) μ[2] + Σ[2,1] Σ[1,1] and covariance matrix −1 Σ[1,2] . Σ[2,2] − Σ[2,1] Σ[1,1] We use this proposition to calculate the posterior distribution on f (x ∗ ), given f (x1 ), . . . , f (xn ). Before doing so, however, we first introduce some additional notation. We let y1:n indicate the column vector [y1 , . . . , yn ]T , and we let x1:n indicate the sequence of vectors (x1 , . . . , xn ). We let f (x1:n ) = [ f (x1 ), . . . , f (xn )]T , and similarly for other functions of x, such as μ0 (·). We introduce similar additional notation  for  pairs of points x, x , so that Σ(x1:n , x1:n ) is the matrix  functions that take Σ0 (x1 ,x1 ) ··· Σ0 (x1 ,xn ) .. . .. . .. . , Σ0 (x ∗ , x1:n ) is the row vector [Σ0 (x ∗ , x1 ), . . . , Σ0 (x ∗ , Σ0 (xn ,x1 ) ··· Σ0 (xn ,xn ) xn )], and Σ0 (x1:n , x ∗ ) is the column vector [Σ0 (x1 , x ∗ ), . . . , Σ0 (xn , x ∗ )]T . This notation allows us to rewrite (3.2) as       μ0 (x1:n ) Σ0 (x1:n , x1:n ) Σ0 (x1:n , x ∗ ) y1:n = Normal , . f (x ∗ ) μ0 (x ∗ ) Σ0 (x ∗ , x1:n ) Σ0 (x ∗ , x ∗ ) (3.4) We now examine this expression in the context of Proposition 1. We set θ[1] = f (x1:n ), θ[2] = f (x ∗ ), μ[1] = μ0 (x1:n ), μ[2] = μ0 (x ∗ ), Σ[1,1] = Σ0 (x1:n , x1:n ), Σ[1,2] = Σ0 (x1:n , x ∗ ), Σ[2,1] = Σ0 (x ∗ , x1:n ), and Σ[2,2] = Σ0 (x ∗ , x ∗ ). Then, applying Proposition 1, we see that the posterior distribution on f (x ∗ ) given observations yi = f (xi ), i = 1, . . . , n is normal, with a mean μn (x ∗ ) and variance σn2 (x ∗ ) given by, μn (x ∗ ) = μ0 (x ∗ ) + Σ0 (x ∗ , x1:n )Σ0 (x1:n , x1:n )−1 ( f (x1:n ) − μ0 (x1:n )), (3.5) σn2 (x ∗ ) (3.6) ∗ ∗ ∗ −1 ∗ = Σ0 (x , x ) − Σ0 (x , x1:n )Σ0 (x1:n , x1:n ) Σ0 (x1:n , x ). The invertibility of Σ0 (x1:n , x1:n ) (and also Σ0 (x ∗ , x ∗ )) required by Proposition 1 depends on the covariance kernel and its parameters (typically called hyperparameters), but this invertibility typically holds as long as these hyperparameters satisfy mild non-degeneracy conditions, and the x1:n are distinct, i.e., that we have not measured the same point more than once. For example, under the squared exponential covariance kernel, invertibility holds as long as α > 0 and the x1:n are distinct. If we have measured a point multiple times, then we can safely drop all but one of the measurements, here where observations are noise-free. Below, we treat the case where observations are noisy, and in this case including multiple measurements of the same point is perfectly reasonable and does not cause issues. 3 Bayesian Optimization for Materials Design 53 2.5 2 1.5 value 1 0.5 0 −0.5 −1 −1.5 −2 50 100 150 200 250 300 x Fig. 3.1 Illustration of Gaussian process regression with noise-free evaluations. The circles show previously evaluated points, (xi , f (xi )). The solid line shows the posterior mean, μn (x), as a function of x, which is an estimate f (x), and the dashed lines show a Bayesian credible interval for each f (x), calculated as μn (x) ± 1.96σn (x). Although this example shows f taking a scalar input, Gaussian process regression can be used for functions with vector inputs Figure 3.1 shows the output from Gaussian process regression. In the figure, circles show points (xi , f (xi )), the solid line shows μn (x ∗ ) as a function of x ∗ , and the dashed lines are positioned at μn (x ∗ )±1.96σn (x ∗ ), forming a 95 % Bayesian credible interval for f (x ∗ ), i.e., an interval in which f (x ∗ ) lies with posterior probability 95 %. (A credible interval is the Bayesian version of a frequentist confidence interval.) Because observations are noise-free, the posterior mean μn (x ∗ ) interpolates the observations f (x ∗ ). 3.3.4 Inference with Just One Observation The expressions (3.5) and (3.6) are complex, and perhaps initially difficult to assimilate. To give more intuition about them, and also to support some additional analysis below in Sect. 3.4, it is useful to consider the simplest case, when we have just a single measurement, n = 1. In this case, all matrices in (3.5) and (3.6) are scalars, Σ0 (x ∗ , x1 ) = Σ0 (x1 , x ∗ ), and the expressions (3.5) and (3.6) can be rewritten as, Σ0 (x ∗ , x1 ) ( f (x1 ) − μ0 (x1 )), Σ0 (x1 , x1 ) Σ0 (x ∗ , x1 )2 . σ12 (x ∗ ) = Σ0 (x ∗ , x ∗ ) − Σ0 (x1 , x1 ) μ1 (x ∗ ) = μ0 (x ∗ ) + (3.7) (3.8) 54 P.I. Frazier and J. Wang Intuition about the expression for the posterior mean We first examine (3.7). We see that the posterior mean of f (x ∗ ), μ1 (x ∗ ), which we can think of as our estimate of f (x ∗ ) after observing f (x1 ), is obtained by taking our original estimate of f (x ∗ ), μ0 (x ∗ ), and adding to it a correction term. This correction term is itself the product of two quantities: the error f (x1 ) − μ0 (x1 ) in our original ∗ 0 (x ,x 1 ) . Typically, Σ0 (x ∗ , x1 ) will be positive, estimate of f (x1 ), and the quantity Σ Σ0 (x1 ,x1 ) ∗ 0 (x ,x 1 ) and hence also Σ . (Recall, Σ0 (x1 , x1 ) is a variance, so is never negative.) Thus, Σ0 (x1 ,x1 ) if f (x1 ) is bigger than expected, f (x1 ) − μ0 (x1 ) will be positive, and our posterior mean μ1 (x ∗ ) will be larger than our prior mean μ0 (x ∗ ). In contrast, if f (x1 ) is smaller than expected, f (x1 ) − μ0 (x1 ) will be negative, and our posterior mean μ1 (x ∗ ) will be smaller than our prior mean μ0 (x ∗ ). ∗ 0 (x ,x 1 ) to understand the effect of the position of We can examine the quantity Σ Σ0 (x1 ,x1 ) ∗ x relative to x1 on the magnitude of the correction to the posterior mean. Notice that x ∗ only enters this expression through the numerator. If x ∗ is close to x1 , then Σ0 (x ∗ , x1 ) will be large under the squared exponential and most other covariance kernels, and positive values for f (x1 ) − μ0 (x1 ) will also cause a strong positive change in μ1 (x ∗ ) relative to μ0 (x ∗ ). If x ∗ is far from x1 , then Σ0 (x ∗ , x1 ) will be close to 0, and f (x1 ) − μ0 (x1 ) will have little effect on μ1 (x ∗ ). Intuition about the expression for the posterior variance Now we examine (3.8). We see that the variance of our belief on f (x ∗ ) under the posterior, σ12 (x ∗ ), is smaller than its value under the prior, Σ0 (x ∗ , x ∗ ). Moreover, when x ∗ is close to x1 , Σ0 (x ∗ , x1 ) will be large, and the reduction in variance from prior to posterior will also be large. Conversely, when x ∗ is far from x1 , Σ0 (x ∗ , x1 ) will be close to 0, and the variance under the posterior will be similar to its value under the prior. As a final remark, we can also rewrite the expression (3.8) in terms of the squared correlation under the prior, Corr( f (x ∗ ), f (x1 ))2 = Σ0 (x ∗ , x1 )2 /(Σ0 (x ∗ , x ∗ )Σ0 (x1 , x1 )) ∈ [0, 1], as   σ12 (x ∗ ) = Σ0 (x ∗ , x ∗ ) 1 − Corr( f (x ∗ ), f (x1 ))2 . We thus see that the reduction in variance of the posterior distribution depends on the squared correlation under the prior, with larger squared correlation implying a larger reduction. 3.3.5 Inference with Noisy Observations The previous section assumed that f (x ∗ ) can be observed exactly, without any error. When f (x ∗ ) is the outcome of a physical experiment, however, our observations are obscured by noise. Indeed, if we were to synthesize and test the same material design x ∗ multiple times, we might observe different results. 3 Bayesian Optimization for Materials Design 55 To model this situation, Gaussian process regression can be extended to allow observations of the form, y(xi ) = f (xi ) + εi , where we assume that the εi are normally distributed with mean 0 and constant variance, λ2 , with independence across i. In general, the variance λ2 is unknown, but we treat it as a known parameter of our model, and then estimate it along with all the other parameters of our model, as discussed below in Sect. 3.3.6. These assumptions of constant variance (called homoscedasticity) and independence make the analysis significantly easier, although they are often violated in practice. Experimental conditions that tend to violate these assumptions are discussed below, as are versions of GP regression that can be used when they are violated. Analysis of independent homoscedastic noise To perform inference under independent homoscedastic noise, and calculate a posterior distribution on the value of the function f (x∗ ) at a given point x∗ , our first step is to write down the joint distribution of our observations y1 , . . . , yn and the quantity we wish to predict, f (x∗ ), under the prior. That is, we write down the distribution of the vector [y1 , . . . , yn , f (x∗ )]. We first observe that [y1 , . . . , yn , f (x∗ )] is the sum of [ f (x1 ), . . . , f (xn ), f (x∗ )] and another vector, [ε1 , . . . , εn , 0]. The first vector has a multivariate normal distribution given by (3.4). The second vector is independent of the first and is also multivariate normal, with a mean vector that is identically 0, and a covariance matrix diag(λ2 , . . . , λ2 , 0). The sum of two independent multivariate normal random vectors is itself multivariate normal, with a mean vector and covariance matrix given, respectively, by the sums of the mean vectors and covariance matrices of the summands. This gives the distribution of [y1 , . . . , yn , f (x∗ )] as       y1:n μ0 (x1:n ) Σ0 (x1:n , x1:n ) + λ2 In Σ0 (x1:n , x ∗ ) ∼ Normal , , (3.9) f (x ∗ ) Σ0 (x ∗ , x ∗ ) μ0 (x ∗ ) Σ0 (x ∗ , x1:n ) where In is the n-dimensional identity matrix. As we did in Sect. 3.3.3, we can use Proposition 1 with the above expression to compute the posterior on f (x ∗ ) given f (x1:n ). We obtain, −1  (y1:n − μ0 (x1:n )) μn (x ∗ ) = μ0 (x ∗ ) + Σ0 (x ∗ , x1:n ) Σ0 (x1:n , x1:n ) + λ2 In (3.10) −1  2 ∗ ∗ ∗ ∗ 2 ∗ σn (x ) = Σ0 (x , x ) − Σ0 (x , x1:n ) Σ0 (x1:n , x1:n ) + λ In Σ0 (x1:n , x ). (3.11) If we set λ2 = 0, so there is no noise, then we recover (3.5) and (3.6). 56 P.I. Frazier and J. Wang 2.5 2 1.5 value 1 0.5 0 −0.5 −1 −1.5 −2 50 100 150 200 250 300 x Fig. 3.2 Illustration of Gaussian process regression with noisy evaluations. As in Fig. 3.1, the circles show previously evaluated points, (xi , yi ), where yi is f (xi ) perturbed by constant-variance independent noise. The solid line shows the posterior mean, μn (x), as a function of x, which is an estimate of the underlying function f , and the dashed lines show a Bayesian credible interval for f , calculated as μn (x) ± 1.96σn (x) Figure 3.2 shows an example of a posterior distribution calculated with Gaussian process regression with noisy observations. Notice that the posterior mean no longer interpolates the observations, and the credible interval has a strictly positive width at points where we have measured. Noise prevents us from observing function values exactly, and so we remain uncertain about the function value at points we have measured. Going beyond homoscedastic independent noise Constant variance is violated if the experimental noise differs across materials designs, which occurs most frequently when noise arises during the synthesis of the material itself, rather than during the evaluation of a material that has already been created. Some work has been done to extend Gaussian process regression to flexibly model heteroscedastic noise (i.e., noise whose variance changes) [18–21]. The main idea in much of this work is to use a second Gaussian process to model the changing variance across the input domain. Much of this work assumes that the noise is independent and Gaussian, though [21] considers non-Gaussian noise. Independence is most typically violated, in the context of physical experiments, when the synthesis and evaluation of multiple materials designs is done together, and the variation in some shared component simultaneously influences these designs, e.g., through variation in the temperature while the designs are annealing together, or through variation in the quality of some constituent used in synthesis. We are aware of relatively little work modeling dependent noise in the context of Gaussian process regression and Bayesian optimization, with one exception being [22]. 3 Bayesian Optimization for Materials Design 57 3.3.6 Parameter Estimation The mean and covariance functions contain several parameters. For example, if we use the squared exponential kernel, a constant mean function, and observations have independent homoscedastic noise, then we must choose or estimate the parameters μ, α, β1 , . . . , βd , λ. These parameters are typically called hyperparameters because they are parameters of the prior distribution. (λ2 is actually a parameter of the likelihood function, but it is convenient to treat it together with the parameters of the prior.) While one may simply choose these hyperparameters directly, based on intuition about the problem, a more common approach is to choose them adaptively, based on data. To accomplish this, we write down an expression for the probability of the observed data y1:n in terms of the hyperparameters, marginalizing over the uncertainty on f (x1:n ). Then, we optimize this expression over the hyperparameters to find settings that make the observed data as likely as possible. This approach to setting hyperparameters is often called empirical Bayes, and it can be seen as an approximation to full Bayesian inference. We detail this approach for the squared exponential kernel with a constant mean function. Estimation for other kernels and mean functions is similar. Using the probability distribution of y1:n from (3.9), and neglecting constants, the natural logarithm of this probability, log p(y1:n | x1:n ) (called the “log marginal likelihood”), can be calculated as  −1 1 1 − (y1:n − μ)T Σ0 (x1:n , x1:n ) + λ2 In (y1:n − μ) − log |Σ0 (x1:n , x1:n ) + λ2 In |, 2 2 where | · | applied to a matrix indicates the determinant. To find the hyperparameters that maximize this log marginal likelihood (the neglected constant does not affect the location of the maximizer), we will take partial derivatives with respect to each hyperparameter. We will then use them to find maximizers of μ and σ 2 := α + λ2 analytically, and then use gradient-based optimization to maximize the other hyperparameters. Taking a partial derivative with respect to μ, setting it to zero, and solving for μ, we get that the value of μ that maximizes the marginal likelihood is n μ̂ =   2 −1 i=1 (Σ0 (x 1:n , x 1:n ) + λ In ) y1:n i n −1 2 i, j=1 (Σ0 (x 1:n , x 1:n ) + λ In )i j . Define R as the matrix with components Ri j = ⎧ ⎪ ⎨1  ⎪ ⎩g exp − d  i=1  βi (xi − x j )2 i = j, i = j, 58 P.I. Frazier and J. Wang where g = α . Then Σ0 (x1:n , x1:n ) σ2 n Σi=1 ( R −1 y1:n )i of R as μ̂ = Σi,n j=1 Ri−1 j + λ2 In = σ 2 R and μ̂ can be written in terms . The log marginal likelihood (still neglecting constants) becomes 1 1 log p(y1:n | x1:n ) ∼ − (y1:n − μ̂)T (σ 2 R)−1 (y1:n − μ̂) − log |σ 2 R|. 2 2 Taking the partial derivative with respect to σ 2 , and noting that μ̂ does not depend on σ 2 , we solve for σ 2 and obtain 1 σ#2 = (y1:n − μ̂)R −1 (y1:n − μ̂). n Substituting this estimate, the log marginal likelihood becomes  log p(y1:n  1 1 T −1 n | x1:n ) ∼ − log |R| (y1:n − μ̂) R (y1:n − μ̂) . n (3.12) The expression (3.12) cannot in general be optimized analytically. Instead, one typically optimizes it numerically using a first- or second-order optimization algorithm, such as Newton’s method or gradient descent, obtaining estimates for β1 , . . . , βd and g. These estimates are in turn substituted to provide an estimate of R, from which estimates μ̂ and σ#2 may be computed. Finally, using σ#2 and the estimated value of g, we may estimate α and λ. 3.3.7 Diagnostics When using Gaussian process regression, or any other machine learning technique, it is advisable to check the quality of the predictions, and to assess whether the assumptions made by the method are met. One way to do this is illustrated by Fig. 3.3, which comes from a simulation of blood flow near the heart, based on [23], for which we get exact (not noisy) observations of f (x). This plot is created with a technique called leave-one-out cross validation. In this technique, we iterate through the datapoints x1:n , y1:n , and for each i ∈ {1, . . . , n}, we train a Gaussian process regression model on all of the data except xi , yi , and then use it, together with xi , to predict what the value yi should be. We obtain from this a posterior mean (the prediction), call it μ−i (xi ), and also a posterior standard deviation, call it σ−i (xi ). When calculating these estimates, it is best to separately re-estimate the hyperparameters each time, leaving out the data (xi , yi ). We then calculate a 95 % credible interval μ−i (xi ) ± 2σ−i (xi ), and create Fig. 3.3 by plotting “Predicted” versus “Actual”, where the “Actual” coordinate (on the x-axis) is yi , and the “Predicted” value (on the y-axis) is pictured as an error bar centered at μ−i (xi ) with half-width 2σ−i (xi ). 3 Bayesian Optimization for Materials Design 59 Fig. 3.3 Diagnostic plot for Gaussian process regression, created with leave-one-out cross validation. For each point in our dataset, we hold that point (xi , yi ) out, train on the remaining points, calculate a 95 % credible interval for yi , and plot this confidence interval as an error bar whose x-coordinate is the actual value yi . If Gaussian process regression is working well, 95 % of the error bars will intersect the diagonal line Predicted = Actual If the uncertainty estimates outputted by Gaussian process regression are behaving as anticipated, then approximately 95 % of the credible intervals will intersect the diagonal line Predicted = Actual. Moreover, if Gaussian process regression’s predictive accuracy is high, then the credible intervals will be short, and their centers will be close to this same line Predicted=Actual. This idea may be extended to noisy function evaluations, under the assumption of independent homoscedastic noise. To handle the fact that the same point may be sampled multiple times, let m(x) be the number of times that a point x ∈ {x1 , . . . , xn } was sampled, and let y(x) be the average of the observed values at this point. Moreover, by holding out all m(x) samples of x and training Gaussian process regression, we would obtain a normal posterior distribution on f (xi ) that has mean μ−i (xi ) and standard deviation σ−i (xi ). Since y(xi ) is then the sum of f (xi ) and some normally distributed noise with distribution of y(xi ) is normal with mean 0 and variance λ2 /m(xi ), the resulting $ mean μ−i (xi ) and standard deviation 2 σ−i (xi ) + λ2 /m(xi ). $ 2 (x ) + λ2 /m(x ). From this, a 95 % credible interval for y(xi ) is then μ−i (xi )±2 σ−i i i We would plot Predicted versus Observed by putting this credible interval along the y-axis at x-coordinate y(xi ). If Gaussian process regression is working well, then approximately 95 % of these credible intervals will intersect the line Predicted = Observed. For Gaussian process regression to best support Bayesian optimization, it is typically most important to have good uncertainty estimates, and relatively less important to have high predictive accuracy. This is because Bayesian optimization uses Gaussian process regression as a guide for deciding where to sample, and so if Gaussian process regression reports that there is a great deal of uncertainty at a particular location and thus low predictive accuracy, Bayesian optimization can choose to sample at this location to improve accuracy. Thus, Bayesian optimization has 60 P.I. Frazier and J. Wang a recourse for dealing with low predictive accuracy, as long as the uncertainty is accurately reported. In contrast, if Gaussian process regression estimates poor performance at a location that actually has near-optimal performance, and also provides an inappropriately low error estimate, then Bayesian optimization may not sample there within a reasonable timeframe, and thus may never correct the error. If either the uncertainty is incorrectly estimated, or the predictive accuracy is unsatisfactorily low, then the most common “fixes” employed are to adopt a different covariance kernel, or to transform the objective function f . If the√ objective function is known to be non-negative, then the transformations log( f ) and f are convenient for optimization because they are both strictly increasing, and so do not change the set of maximizers (or minimizers). If f is not non-negative, but is bounded below by some other known quantity a, then one may first shift f upward by a. 3.3.8 Predicting at More Than One Point Below, to support the development of the knowledge-gradient method in Sects. 3.4.2 and 3.6, it will be useful to predict the value of f at multiple points, x1∗ , . . . , xk∗ , with noise. To do so, we could certainly apply (3.10) and (3.11) separately for each x1∗ , . . . , xk∗ , and this would provide us with both an estimate (the posterior mean) and a measure of the size of the error in this estimate (the posterior variance) associated with each f (xi∗ ). It would not, however, quantify the relationship between the errors at several different locations. For this, we must perform the estimation jointly. ∗ )], which is, As we did in Sect. 3.3.5, we begin with our prior on [y1:n , f (x1:k       ∗ y1:n ) μ0 (x1:n ) Σ0 (x1:n , x1:n ) + λ2 In Σ0 (x1:n , x1:k ∼ Normal , , ∗ ∗ ∗ ∗ ∗ f (x1:k ) ) , x1:n ) Σ0 (x1:k , x1:k ) μ0 (x1:k Σ0 (x1:k ∗ We then use Proposition 1 to compute the posterior on f (x1:k ) given f (x1:n ), which ∗ ∗ ∗ , x1:k ) is multivariate normal with mean vector μn (x1:k ) and covariance matrix Σn (x1:k given by, % &−1 ∗ ∗ ∗ μn (x1:k ) = μ0 (x1:k ) + Σ0 (x1:k , x1:n ) Σ0 (x1:n , x1:n ) + λ2 In (y1:n − μ0 (x1:n )), % ∗ ∗ ∗ ∗ ∗ Σn (x1:k , x1:k ) = Σ0 (x1:k , x1:k ) − Σ0 (x1:k , x1:n ) Σ0 (x1:n , x1:n ) + λ2 In &−1 (3.13) ∗ Σ0 (x1:n , x1:k ). (3.14) We see that setting k = 1 provides the expressions (3.10) and (3.11) from Sect. 3.3.5. 3 Bayesian Optimization for Materials Design 61 3.3.9 Avoiding Matrix Inversion The expressions (3.10) and (3.11) for the posterior mean and variance in the noisy case, and also (3.7) and (3.8) in the noise-free case, include a matrix inversion term. Calculating this matrix inversion is slow and can be hard to accomplish accurately in practice, due to the finite precision of floating point implementations. Accuracy is especially an issue when Σ has terms that are close to 0, which arises when data points are close together. In practice, rather than calculating a matrix inverse directly, it is typically faster and more accurate to use a mathematically equivalent algorithm, which performs a Cholesky decomposition and then solves a linear system. This algorithm is described below, and is adapted from Algorithm 2.1 in Sect. 2.3 of [14]. This algorithm also computes the log marginal likelihood required for estimating hyperparameters in Sect. 3.3.6. Algorithm 1 Implementation using Cholesky decomposition Require: x1:n (inputs), y1:n (responses), Σ0 (x, x  ) (covariance function), λ2 (variance of noise), x ∗ (test input).   1: L = Cholesky Σ0 (x1:n , x1:n ) + λ2 In 2: δ = L T \ (L\ (y1:n − μ0 (x1:n ))) 3: μn (x ∗ ) = μ0 (x ∗ ) + Σ0 (x ∗ , x1:n )δ 4: v = L\Σ0 (x1:n , x ∗ ) 5: σn2 (x ∗ ) = Σ0 (x ∗ , x ∗ ) − v T v 6: log p(y1:n | x1:n ) = − 21 (y1:n − μ0 (x1:n ))T α − Σi log L ii − n2 log 2π 7: return μn (x ∗ ) (mean), σn2 (x ∗ ) (variance), log p(y1:n | x1:n ) (log marginal likelihood). 3.4 Choosing Where to Sample Being able to infer the value of the objective function f (x) at unevaluated points based on past data x1:n ,y1:n is only one part of finding good designs. The other part is using this ability to make good decisions about where to direct future sampling. Bayesian optimization methods addresses this by using a measure of the value of the information that would be gained by sampling at a point. Bayesian optimization methods then choose the point to sample next for which this value is largest. A number of different ways of measuring the value of information have been proposed. Here, we describe two in detail, expected improvement [2, 4], and the knowledge gradient [24, 25], and then survey a broader collection of design criteria. 62 P.I. Frazier and J. Wang 3.4.1 Expected Improvement Expected improvement, as it was first proposed, considered only the case where measurements are free from noise. In this setting, suppose we have taken n measurements at locations x1:n and observed y1:n . Then f n∗ = max f (xi ) i=1,...,n is the best value observed so far. Suppose we are considering evaluating f at a new point x. After this evaluation, the best value observed will be ∗ = max( f (x), f n∗ ), f n+1 and the difference between these values, which is the improvement due to sampling, is ∗ − f n∗ = max( f (x) − f n∗ , 0) = ( f (x) − f n∗ )+ , f n+1 where a + = max(a, 0) indicates the positive part function. Ideally, we would choose x to make this improvement as large as possible. Before actually evaluating f (x), however, we do not know what this improvement will be, so we cannot implement this strategy. However, we do have a probability distribution on f (x), from Gaussian process regression. The expected improvement, indicated EI(x), is obtained by taking the expectation of this improvement with respect to the posterior distribution on f (x) given x1:n , y1:n . EI(x) = E n [( f (x) − f n∗ )+ ], (3.15) where E n [ · ] = E[ · |x1:n , y1:n ] indicates the expectation with respect to the posterior distribution. The expectation in (3.15) can be computed more explicitly, in terms of the normal cumulative distribution function (cdf) Φ(·), and the normal probability density function (pdf) ϕ(·). Recalling from Sect. 3.3.3 that f (x) ∼ Normal(μn (x), σn2 (x)), where μn (x) and σn2 (x) are given by (3.5) and (3.6), and integrating with respect to the normal distribution (a derivation may be found in the Derivations and Proofs section), we obtain, EI(x) = (μn (x) − f n∗ )Φ  μn (x) − f n∗ σn (x)   + σn (x)ϕ μn (x) − f n∗ σn (x)  . (3.16) Figure 3.4 plots this expected improvement for a problem with a one-dimensional input space. We can see from this plot that the expected improvement is largest at locations where both the posterior mean μn (x) is large, and also the posterior standard deviation σn (x) is large. This is reasonable because those points that are most likely to provide large gains are those points that have a high predicted value, but that also 3 Bayesian Optimization for Materials Design 63 2 value 1 0 −1 −2 50 100 150 200 250 300 200 250 300 x 0.5 0.4 EI 0.3 0.2 0.1 0 50 100 150 x Fig. 3.4 Upper panel shows the posterior distribution in a problem with no noise and a onedimensional input space, where the circles are previously measured points, the solid line is the posterior mean μn (x), and the dashed lines are at μn (x) ± 2σn (x). Lower panel shows the expected improvement EI(x) computed from this posterior distribution. An “x” is marked at the point with the largest expected improvement, which is where we would evaluate next have significant uncertainty. Indeed, at points where we have already observed, and thus have no uncertainty, the expected improvement is 0. This is consistent with the idea that, in a problem without noise, there is no value to repeating an evaluation that has already been performed. This idea of favoring points that, on the one hand, have a large predicted value, but, on the other hand, have a significant amount of uncertainty, is called the exploration versus exploitation tradeoff, and appears in areas beyond Bayesian optimization, especially in reinforcement learning [26, 27] and multi-armed bandit problems [28, 29]. In these problems, we are taking actions repeatedly over time whose payoffs are uncertain, and wish to simultaneously get good immediate rewards, while learning the reward distributions for all actions to allow us to get better rewards in the future. We emphasize, however, that the correct balance between exploration and exploitation is different in Bayesian optimization as compared with multi-armed bandits, and should more favor exploration: in optimization, the advantage of measuring where the predicted value is high is that these areas tend to give more useful information about where the optimum lies; in contrast, in problems where we must “learn while doing” like multi-armed bandits, evaluating an action with high predicted reward is good primarily because it tends to give a high immediate reward. 64 1 0.5 Δ (x) 0 n Fig. 3.5 Contour plot of the expected improvement, as a function of the difference in means Δn (x) := μn (x) − f n∗ and the posterior standard deviation σn (x). The expected improvement is larger when the difference in means is larger, and when the standard deviation is larger P.I. Frazier and J. Wang −0.5 −1 0.2 0.4 0.6 0.8 1 σn(x) We can also see the exploration versus exploitation tradeoff implicit in the expected improvement function in the contour plot, Fig. 3.5. This plot shows the contours of EI(x) as a function of the posterior mean, expressed as a difference from the previous best, Δn (x) := μn (x) − f n∗ , and the posterior standard deviation σn (x). Given the expression (3.16), Bayesian optimization algorithms based on expected improvement, such as the Efficient Global Optimization (EGO) algorithm proposed by [4], and the earlier algorithms of Mockus (see, e.g., the monograph [2]), then recommend sampling at the point with the largest expected improvement. That is, xn+1 ∈ argmax EI(x). (3.17) x Finding the point with largest expected improvement is itself a global optimization problem, like the original problem that we wished to solve (3.1). Unlike (3.1), however, EI(x) can be computed quickly, and its first and second derivatives can also be computed quickly. Thus, we can expect to be able to solve (3.1) relatively well using an off-the-shelf optimization method for continuous global optimization. A common approach is to use a local solver for continuous optimization, such as gradient ascent, in a multistart framework, where we start the local solver from many starting points chosen at random, and then select the best local solution discovered. In Sect. 3.5 we describe several codes that implement expected improvement methods, and each makes its own choice about how to solve (3.17). The algorithm given by (3.17) is optimal under three assumptions: (1) that we will take only a single sample; (2) there is no noise in our samples; and (3) that the x we will report as our final solution (i.e., the one that we will implement) must be among those previously sampled. In practice, assumption (1) is violated, as Bayesian optimization methods like (3.17) are applied iteratively, and is made simply because it simplifies the analysis. Being able to handle violations of assumption (1) in a more principled way is of great interest to researchers working on Bayesian optimization methodology, and some partial progress in that direction is discussed in Sect. 3.4.3. Assumption 3 Bayesian Optimization for Materials Design 65 (2) is also often violated in a broad class of applications, especially those involving physical experiments or stochastic simulations. In the next section, we present an algorithm, the knowledge-gradient algorithm [24, 25], that relaxes this assumption (2), and also allows relaxing assumption (3) if this is desired. 3.4.2 Knowledge Gradient When we have noise in our samples, the derivation of expected improvement meets with difficulty. In particular, if we have noise, then f n∗ = maxi=1,...,n f (xi ) is not precisely known, preventing us from using the expression (3.16). One may simply take a quantity like maxi=1,...,n yi that is similar in spirit to f n∗ = maxi=1,...,n f (xi ), and replace f n∗ in (3.16) with this quantity, but the resulting algorithm is no longer justified by an optimality analysis. Indeed, for problems with a great deal of noise, maxi=1,...,n yi tends to be significantly larger than the true underlying value of the best point previously sampled, and so the resulting algorithm may be led to make a poor tradeoff between exploration and exploitation, and exhibit poor performance in such situations. Instead, the knowledge-gradient algorithm [24, 25] takes a more principled approach, and starts where the derivation of expected improvement began, but fully accounts for the introduction of noise (assumption 2 in Sect. 3.4.1), and the possibility that we wish to search over a class of solutions broader than just those that have been previously evaluated when recommending the final solution (assumption 3 in Sect. 3.4.1). We first introduce a set An , which is the set of points from which we would choose the final solution, if we were asked to recommend a final solution at time n, based on x1:n , y1:n . For tractability, we suppose An is finite. For example, if A is finite, as it often is in discrete optimization via simulation problems, we could take An = A, allowing the whole space of feasible solutions. This choice was considered in [24]. Alternatively, one could take An = {x1 , . . . , xn }, stating that one is willing to consider only those points that have been previously evaluated. This choice is consistent with the expected improvement algorithm. Indeed, we will see that when one makes this choice, and measurements are free from noise, then the knowledgegradient algorithm is identical to the expected improvement algorithm. Thus, the knowledge-gradient algorithm generalizes the expected improvement algorithm. If we were to stop sampling at time n, then the expected value of a point x ∈ An based on the information available would be E n [ f (x)] = μn (x). In the special case when evaluations are free from noise, this is equal to f (x), but when there is noise, these two quantities may differ. If we needed to report a final solution, we would then choose the point in An for which this quantity is the largest, i.e., we would choose argmaxx∈An μn (x). Moreover, the expected value of this solution would be μ∗n = max μn (x). x∈An 66 P.I. Frazier and J. Wang If evaluations are free from noise and An = {x1:n }, then μ∗n is equal to f n∗ , but in general these quantities may differ. If we take one additional sample, then the expected value of the solution we would report based on this additional information is μ∗n+1 = max μn+1 (x), x∈An+1 where as before, An+1 is some finite set of points we would be willing to consider when choosing a final solution. Observe in this expression that μn+1 (x) is not necessarily the same as μn (x), even for points x ∈ {x1:n } that we had previously evaluated, but that μn+1 (x) can be computed from the history of observations x1:n+1 , y1:n+1 . The improvement in our expected solution value is then the difference between these two quantities, μ∗n+1 − μ∗n . This improvement is random at time n, even fixing xn+1 , through its dependence on yn+1 , but we can take its expectation. The resulting quantity is called the knowledge-gradient (KG) factor, and is written,   KGn (x) = E n μ∗n+1 − μ∗n | xn+1 = x . (3.18) Calculating this expectation is more involved than calculating the expected improvement, but nevertheless can also be done analytically in terms of the normal pdf and normal cdf. This is described in more detail in the Derivations and Proofs section. The knowledge-gradient algorithm is then the one that chooses the point to sample next that maximizes the KG factor, argmax KGn (x). x The KG factor for a one-dimensional optimization problem with noise is pictured in Fig. 3.6. We see a similar tradeoff between exploration and exploitation, where the KG factor favors measuring points with a large μn (x) and a large σn (x). We also see local minima of the KG factor at points where we previously evaluated, just as with the expected improvement, but because there is noise in our samples, the value at these points is not 0—indeed, when there is noise, it may be useful to sample repeatedly at a point. Choice of An and An+1 Recall that the KG factor depends on the choice of the sets An and An+1 , through the dependence of μ∗n and μ∗n+1 on these sets. Typically, if we choose these sets to contain more elements, then we allow μ∗n and μ∗n+1 to range over a larger portion of the space, and we allow the KG factor calculation to more accurately approximate the value that would result if we allowed ourself to implement the best option. However, as we increase the size of these sets, computing the KG factor is slower, making implementation of the KG method more computationally intensive. 3 Bayesian Optimization for Materials Design 67 2 value 1 0 −1 −2 50 100 150 200 250 300 200 250 300 x log(KG factor) −2 −4 −6 −8 −10 −12 −14 50 100 150 x Fig. 3.6 Upper panel shows the posterior distribution in a problem with independent normal homoscedastic noise and a one-dimensional input space, where the circles are previously measured points, the solid line is the posterior mean μn (x), and the dashed lines are at μn (x) ± 2σn (x). Lower panel shows the natural logarithm of the knowledge-gradient factor KG(x) computed from this posterior distribution, where An = An+1 are the discrete grid {1, . . . , 300}. An “x” is marked at the point with the largest KG factor, which is where the KG algorithm would evaluate next For applications with a finite A, [24] proposed setting An+1 = An = A, which was seen to require fewer function evaluations to find points with large f , in comparison with expected improvement on noise-free problems, and in comparison with another Bayesian optimization method, sequential kriging optimization (SKO) [30] on noisy problems. However, the computation and memory required grows rapidly with the size of A, and is typically not feasible when A contains more than 10,000 points. For large-scale applications, [25] proposed setting An+1 = An = {x1:n+1 } in (3.18), and called the resulting quantity the approximate knowledge gradient (AKG), observing that this choice maintains computational tractability as A grows, but also offers good performance. This algorithm is implemented in the DiceKriging package [31]. Finally, in noise-free problems (but not in problems with noise), setting An+1 = {x1:n+1 } and An = {x1:n } recovers expected improvement. 68 P.I. Frazier and J. Wang 3.4.3 Going Beyond One-Step Analyses, and Other Methods Both expected improvement and the knowledge-gradient method are designed to be optimal, in the special case where we will take just one more function evaluation and then choose a final solution. They are not, however, known to be optimal for the more general case in which we will take multiple measurements, which is the way they are used in practice. The optimal algorithm for this more general setting is understood to be the solution to a partially observable Markov decision process, but actually computing the optimal solution using this understanding is intractable using current methods [32]. Some work has been done toward the goal of developing such an optimal algorithm [33], but computing the optimal algorithm remains out of reach. Optimal strategies have been computed for other closely related problems in optimization of expensive noisy functions, including stochastic root-finding [34], multiple comparisons with a standard [35], and small instances of discrete noisy optimization with normally distributed noise (also called “ranking and selection”) [36]. Expected improvement and the knowledge gradient are both special cases of the more general concept of value of information, or expected value of sample information (EVSI) [37], as they calculate the expected reward of a final implementation decision as a function of the posterior distribution resulting from some information, subtract from this the expected reward that would result from not having the information, and then take the expectation of this difference with respect to the information itself. Many other Bayesian optimization methods have been proposed. A few of these methods optimize the value of information, but are calculated using different assumptions than those used to derive expected improvement or value of information. A larger number of these methods optimize quantities that do not correspond to a value of information, but are derived using analyses that are similar in spirit. These include methods that optimize the probability of improvement [1, 38, 39], the entropy of the posterior distribution on the location of the maximum [40], and other composite measures involving the mean and the standard deviation of the posterior [30]. Other Bayesian optimization methods are designed for problem settings that do not match the assumptions made in this tutorial. These include [41–43], which consider multiple objectives; [6, 44–46], which consider multiple simultaneous function evaluations; [47–49], which consider objective functions that can be evaluated with multiple fidelities and costs; [50], which considers Bernoulli outcomes, rather than normally distributed ones; [51], which considers expensive-to-evaluate inequality constraints; and [52], which considers optimization over the space of small molecules. 3 Bayesian Optimization for Materials Design 69 3.5 Software There are a number of excellent software packages, both freely available and commercial, that implement the methods described in this chapter, and other similar methods. • Metrics Optimization Engine (MOE), an open-source code in C++ and Python, developed by the authors and engineers at Yelp. http://yelp.github.io/MOE/, • Spearmint, an open-source code in Python, implementing algorithms described in [6]. https://github.com/JasperSnoek/spearmint • DiceKriging and DiceOptim, an open-source R package that implements expected improvement, the approximate knowledge-gradient method, and a variety of algorithms for parallel evaluations. An overview is provided in [31]. http://cran.r-project.org/web/packages/DiceOptim/index.html, • TOMLAB, a commercial package for MATLAB. http://tomopt.com/tomlab/ • matlabKG, an open-source research code that implements the discrete knowledgegradient method for small-scale problems. http://people.orie.cornell.edu/pfrazier/src.html A list of software packages focused on Gaussian process regression (but not Bayesian optimization) may be found at http://www.gaussianprocess.org/. 3.6 Conclusion We have presented Bayesian optimization, including Gaussian process regression, the expected improvement method, and the knowledge-gradient method. In making this presentation, we wish to emphasize that this approach to materials design acknowledges the inherent uncertainty in statistical prediction and seeks to guide experimentation in a way that is robust to this uncertainty. It is inherently iterative, and does not seek to circumvent the fundamental trial-and-error process. This is in contrast with another approach to informatics in materials design, which holds the hope that predictive methods can short-circuit the iterative loop entirely. In this alternative view of the world, one hopes to create extremely accurate prediction techniques, either through physically-motivated ab initio calculations, or using datadriven machine learning approaches, that are so accurate that one can rely on the predictions alone rather than on physical experiments. If this can be achieved, then we can search over materials designs in silico, find those designs that are predicted to perform best, and test those designs alone in physical experiments. For this approach to be successful, one must have extremely accurate predictions, which limits its applicability to settings where this is possible. We argue that, in contrast, predictive techniques can be extremely powerful even if they are not perfectly accurate, as long as they are used in a way that acknowledges inaccuracy, builds in robustness, and reduces this inaccuracy through an iterative dialog with physical 70 P.I. Frazier and J. Wang reality mediated by physical experiments. Moreover, we argue that mathematical techniques like Bayesian optimization, Bayesian experimental design, and optimal learning provide us the mathematical framework for accomplishing this goal in a principled manner, and for using our power to predict as effectively as possible. Acknowledgments Peter I. Frazier was supported by AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-0038, NSF CAREER CMMI-1254298, NSF IIS-1247696, and the ACSF’s AVF. Jialei Wang was supported by AFOSR FA9550-12-1-0200. Derivations and Proofs This section contains derivations and proofs of equations and theoretical results found in the main text. Proof of Proposition 1 Proof Using Bayes’ rule, the conditional probability density of θ[2] at a point u [2] given that θ[1] = u [1] is p(θ[1] = u [1] , θ[2] = u [2] ) ∝ p(θ[1] = u [1] , θ[2] = u [2] ) p(θ[1] = u [1] )        1 u [1] − μ[1] T Σ[1,1] Σ[1,2] −1 u [1] − μ[1] (3.19) ∝ exp − . Σ[2,1] Σ[2,2] u [2] − μ[2] 2 u [2] − μ[2] p(θ[2] = u [2] | θ[1] = u [1] ) = To deal with the inverse matrix in this expression, we use the following identity for  A B inverting a block matrix: the inverse of the block matrix , where both A and C D D are invertible square matrices, is  A B C D −1  −(A − B D −1 C)−1 B D −1 (A − B D −1 C)−1 . = −(D − C A−1 B)−1 C A−1 (D − C A−1 B)−1  (3.20) Applying (3.20) to (3.19), and using a bit of algebraic manipulation to get rid of constants, we have   1 new T new −1 new p(θ[2] = u [2] | θ[1] = u [1] ) ∝ exp − (u [2] − μ ) (Σ ) (u [2] − μ ) , 2 (3.21) −1 −1 where μnew = μ[2] − Σ[2,1] Σ[1,1] (u [1] − μ[1] ) and Σ new = Σ[2,2] − Σ[2,1] Σ[1,1] Σ[1,2] . 3 Bayesian Optimization for Materials Design 71 We see that (3.21) is simply the unnormalized probability density function of a normal distribution. Thus the conditional distribution of θ[2] given θ[1] = u [1] is multivariate normal, with mean μnew and covariance matrix Σ new . Derivation of Equation (3.16) Since f (x) ∼Normal(μn (x), σn2 (x)), the probability density of f (x) is p( f (x) = z) = √12π exp (z − μn (x))2 /2σn (x)2 . We use this to calculate EI(x): EI(x) = E n [( f (x) − f n∗ )+ ] ' ∞ −(z−μn (x))2 1 2 = (z − f n∗ ) √ e 2σn (x) dz 2πσn (x) f n∗   ∗  ' ∞ −(z−μn (x))2 f n − μn (x) 1 2 = e 2σn (x) dz − f n∗ 1 − Φ z√ σn (x) 2πσn (x) f n∗    ∗ ' ∞ −(z−μn (x))2 f n − μn (x) 1 2 e 2σn (x) dz − f n∗ 1 − Φ = (μn (x) + (z − μn (x))) √ σn (x) 2πσn (x) f n∗    ∗ ' ∞ −(z−μn (x))2 1 f n − μn (x) 2 = e 2σn (x) dz + (μn (x) − f n∗ ) 1 − Φ (z − μn (x)) √ σn (x) 2πσn (x) f n∗    ∗ −( f n∗ −μn (x))2 f n − μn (x) 1 = σn (x) √ e 2σn (x)2 + (μn (x) − f n∗ ) 1 − Φ σn (x) 2π   ∗  ∗   f n − μn (x) f n − μn (x) ∗ = (μn (x) − f n ) 1 − Φ + σn (x)ϕ σn (x) σn (x)     μn (x) − f n∗ μn (x) − f n∗ ∗ = (μn (x) − f n )Φ + σn (x)ϕ . σn (x) σn (x) Calculation of the KG factor The KG factor (3.18) is calculated by first considering how the quantity μ∗n+1 − μ∗n depends on the information that we have at time n, and the additional datapoint that we will obtain, yn+1 . First observe that μ∗n+1 − μ∗n is a deterministic function of the vector [μn+1 (x) : x ∈ An+1 ] and other quantities that are known at time n. Then, by applying the analysis in Sect. 3.3.5, but letting the posterior given x1:n , y1:n play the role of the prior, we obtain the following version of (3.10), which applies to any given x, μn+1 (x) = μn (x) + Σn (x, xn+1 ) (yn+1 − μn (xn+1 )) . Σn (xn+1 , xn+1 ) + λ2 (3.22) 72 P.I. Frazier and J. Wang In this expression, μn (·) and Σn (·, ·) are given by (3.13) and (3.14). We see from this expression that μn+1 (x) is a linear function of yn+1 , with an intercept and a slope that can be computed based on what we know after the nth measurement. We will calculate the distribution of yn+1 , given what we have observed at time n. First, f (xn+1 )|x1:n , y1:n ∼ Normal (μn (xn+1 ), Σn (xn+1 , xn+1 )). Since yn+1 = f (xn+1 ) + εn+1 , where εn+1 is independent with distribution εn+1 ∼ Normal(0, λ2 ), we have   yn+1 |x1:n , y1:n ∼ Normal μn (xn+1 ), Σn (xn+1 , xn+1 ) + λ2 . Plugging the distribution of yn+1 into (3.22) and doing some algebra, we have   σ 2 (x, xn+1 ) , μn+1 (x)|x1:n , y1:n ∼ Normal μn (x),( where ( σ (x, xn+1 ) = √ Σn (x,xn+1 ) Σn (xn+1 ,xn+1 )+λ2 . Moreover, we can write μn+1 (x) as σ (x, xn+1 )Z , μn+1 (x) = μn (x) + ( ) where Z = (yn+1 − μn (xn+1 ))/ Σn (xn+1 , xn+1 ) + λ2 is a standard normal random variable, given x1:n and y1:n . Now (3.18) becomes  KGn (x) = E n    max μn (x ) + ( σ (x , xn+1 )Z | xn+1 = x − μ∗n . x  ∈An+1 Thus, computing the KG factor comes down to being able to compute the expectation of the maximum of a collection of linear functions of a scalar normal random variable. Algorithm 2 of [24], with software provided as part of the matlabKG library [53], computes the quantity  h(a, b) = E  max (ai + bi Z ) − max ai i=1,...,|a| i=1,...,|a| for arbitrary equal-length vectors a and b. Using this ability, and letting μn (An+1 ) be σ (An+1 , x) be the vector [( σ (x  , x) : x  ∈ An+1 ], the vector [μn (x  ) : x  ∈ An+1 ] and ( we can write the KG factor as   σ (An+1 , x)) + max(μn (An+1 )) − μ∗n . KGn (x) = h(μn (An+1 ),( If An+1 = An , as it is in the versions of the knowledge-gradient method proposed in [24, 25], then the last term max(μn (An+1 )) − μ∗n is equal to 0 and vanishes. 3 Bayesian Optimization for Materials Design 73 References 1. H.J. Kushner, A new method of locating the maximum of an arbitrary multi- peak curve in the presence of noise. J. Basic Eng. 86, 97–106 (1964) 2. J. Mockus, Bayesian Approach to Global Optimization: Theory and Applications (Kluwer Academic, Dordrecht, 1989) 3. J. Mockus, V. Tiesis, A. Zilinskas, The application of Bayesian methods for seeking the extremum, in Towards Global Optimisation, ed. by L.C.W. Dixon, G.P. Szego, vol. 2 (Elsevier Science Ltd., North Holland, Amsterdam, 1978), pp. 117–129 4. D.R. Jones, M. Schonlau, W.J. Welch, Efficient Global Optimization of Expensive Black-Box Functions. J. Global Optim. 13(4), 455–492 (1998) 5. A. Booker, J. Dennis, P. Frank, D. Serafini, V. Torczon, M.W. Trosset, Optimization using surrogate objectives on a helicopter test example. Prog. Syst. Control Theor. 24, 49–58 (1998) 6. J. Snoek, H. Larochelle, R.P. Adams, Practical bayesian optimization of machine learning algorithms. in Advances in Neural Information Processing Systems, pp. 2951–2959 (2012) 7. E. Brochu, M. Cora, N. de Freitas, A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical Report TR-2009-023, Department of Computer Science, University of British Columbia, November 2009 8. A. Forrester, A. Sobester, A. Keane, Engineering Design Via Surrogate Modelling: A Practical Guide (Wiley, West Sussex, UK, 2008) 9. T.J. Santner, B.W. Willians, W. Notz, The Design and Analysis of Computer Experiments (Springer, New York, 2003) 10. M.J. Sasena, Flexibility and Efficiency Enhancements for Constrained Global Design Optimization with Kriging Approximations. Ph.D. thesis, University of Michigan (2002) 11. D.G. Kbiob, A statistical approach to some basic mine valuation problems on the witwatersrand. J. Chem. Metall. Min. Soc. S. Afr. (1951) 12. G. Matheron, The theory of regionalized variables and its applications, vol 5. École national supérieure des mines (1971) 13. N. Cressie, The origins of kriging. Math. Geol. 22(3), 239–252 (1990) 14. C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning (MIT Press, Cambridge, MA, 2006) 15. C.E. Rasmussen (2011), http://www.gaussianprocess.org/code, Accessed 15 July 2015 16. A.B. Gelman, J.B. Carlin, H.S. Stern, D.B. Rubin, Bayesian Data Analysis (CRC Press, Boca Raton, FL, second edition, 2004) 17. J.O. Berger, Statistical decision theory and Bayesian analysis (Springer-Verlag, New York, second edition) (1985) 18. B. Ankenman, B.L. Nelson, J. Staum, Stochastic kriging for simulation metamodeling. Oper. Res. 58(2), 371–382 (2010) 19. P.W. Goldberg, C.K.I. Williams, C.M. Bishop, Regression with input-dependent noise: a gaussian process treatment. Advances in neural information processing systems, p. 493–499 (1998) 20. K. Kersting, C. Plagemann, P. Pfaff, W. Burgard, Most likely heteroscedastic Gaussian process regression. In Proceedings of the 24th international conference on Machine learning, ACM, pp. 393–400 (2007) 21. C. Wang, Gaussian Process Regression with Heteroscedastic Residuals and Fast MCMC Methods. Ph.D. thesis, University of Toronto (2014) 22. P.I. Frazier, J. Xie, S.E. Chick, Value of information methods for pairwise sampling with correlations, in Proceedings of the 2011 Winter Simulation Conference, ed. by S. Jain, R.R. Creasey, J. Himmelspach, K.P. White, M. Fu (Institute of Electrical and Electronics Engineers Inc, Piscataway, New Jersey, 2011), pp. 3979–3991 23. S. Sankaran, A.L. Marsden, The impact of uncertainty on shape optimization of idealized bypass graft models in unsteady flow. Physics of Fluids (1994-present), 22(12):121–902 (2010) 74 P.I. Frazier and J. Wang 24. P.I. Frazier, W.B. Powell, S. Dayanik, The knowledge gradient policy for correlated normal beliefs. INFORMS J. Comput. 21(4), 599–613 (2009) 25. W. Scott, P.I. Frazier, W.B. Powell, The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression. SIAM J. Optim. 21(3), 996–1026 (2011) 26. L.P. Kaelbling, Learning in Embedded Systems (MIT Press, Cambridge, MA, 1993) 27. R.S. Sutton, A.G. Barto, Reinforcement Learning (The MIT Press, Cambridge, Massachusetts, 1998) 28. J. Gittins, K. Glazebrook, R. Weber. Multi-armed Bandit Allocation Indices. Wiley, 2nd edition (2011) 29. A. Mahajan, D. Teneketzis, Multi-armed bandit problems. In D. Cochran A. O. Hero III, D. A. Castanon, K. Kastella, (Ed.). Foundations and Applications of Sensor Management. SpringerVerlag (2007) 30. D. Huang, T.T. Allen, W.I. Notz, N. Zeng, Global Optimization of Stochastic Black-Box Systems via Sequential Kriging Meta-Models. J. Global Optim. 34(3), 441–466 (2006) 31. O. Roustant, D. Ginsbourger, Y. Deville, Dicekriging, diceoptim: two R packages for the analysis of computer experiments by kriging-based metamodelling and optimization. J. Stat. Softw. 51(1), p. 54 (2012) 32. P.I. Frazier, Learning with Dynamic Programming. John Wiley and Sons (2011) 33. D. Ginsbourger, R. Riche, Towards gaussian process-based optimization with finite time horizon. mODa 9–Advances in Model-Oriented Design and Analysis, p. 89–96 (2010) 34. R. Waeber, P.I. Frazier, S.G. Henderson, Bisection search with noisy responses. SIAM J. Control Optim. 51(3), 2261–2279 (2013) 35. J. Xie, P.I. Frazier, Sequential bayes-optimal policies for multiple comparisons with a known standard. Oper. Res. 61(5), 1174–1189 (2013) 36. P.I. Frazier, Tutorial: Optimization via simulation with bayesian statistics and dynamic programming, in Proceedings of the 2012 Winter Simulation Conference Proceedings, ed. by C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, A.M. Uhrmacher (Institute of Electrical and Electronics Engineers Inc., Piscataway, New Jersey, 2012), pp. 79–94 37. R.A. Howard, Information Value Theory. Syst. Sci. Cybern. IEEE Trans. 2(1), 22–26 (1966) 38. C.D. Perttunen, A computational geometric approach to feasible region division inconstrained global optimization. in Proceedings of 1991 IEEE International Conference on Systems, Man, and Cybernetics, 1991.’Decision Aiding for Complex Systems, pp. 585–590 (1991) 39. B.E. Stuckman, A global search method for optimizing nonlinear systems. Syst. Man Cybern. IEEE Trans. 18(6), 965–977 (1988) 40. J. Villemonteix, E. Vazquez, E. Walter, An informational approach to the global optimization of expensive-to-evaluate functions. J. Global Optim. 44(4), 509–534 (2009) 41. D.C.T. Bautista, A Sequential Design for Approximating the Pareto Front using the Expected Pareto Improvement Function. Ph.D. thesis, The Ohio State University (2009) 42. P.I. Frazier, A.M. Kazachkov, Guessing preferences: a new approach to multi-attribute ranking and selection, in Proceedings of the 2011 Winter Simulation Conference, ed. by S. Jain, R.R. Creasey, J. Himmelspach, K.P. White, M. Fu (Institute of Electrical and Electronics Engineers Inc, Piscataway, New Jersey, 2011), pp. 4324–4336 43. J. Knowles, ParEGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems. Evol. Comput. IEEE Trans. 10(1), 50–66 (2006) 44. S.C. Clark, J. Wang, E. Liu, P.I. Frazier, Parallel global optimization using an improved multipoints expected improvement criterion (working paper, 2014) 45. D. Ginsbourger, R. Le Riche, L. Carraro, A multi-points criterion for deterministic parallel global optimization based on kriging. In International Conference on Nonconvex Programming, NCP07, Rouen, France, December 2007 46. D. Ginsbourger, R. Le Riche, and L. Carraro, Kriging is well-suited to parallelize optimization. In Computational Intelligence in Expensive Optimization Problems, Springer, vol. 2, p. 131– 162 (2010) 3 Bayesian Optimization for Materials Design 75 47. A.I.J. Forrester, A. Sóbester, A.J. Keane, Multi-fidelity optimization via surrogate modelling. Proc. R. Soc. A: Math. Phys. Eng. Sci. 463(2088), 3251–3269 (2007) 48. P.I. Frazier, W.B. Powell, H.P. Simão, Simulation model calibration with correlated knowledgegradients, in Proceedings of the 2009 Winter Simulation Conference Proceedings, ed. by M.D. Rossetti, R.R. Hill, B. Johansson, A. Dunkin, R.G. Ingalls (Institute of Electrical and Electronics Engineers Inc, Piscataway, New Jersey, 2009), pp. 339–351 49. D. Huang, T.T. Allen, W.I. Notz, R.A. Miller, Sequential kriging optimization using multiplefidelity evaluations. Struct. Multi. Optim. 32(5), 369–382 (2006) 50. J. Bect, D. Ginsbourger, L. Li, V. Picheny, E. Vazquez, Sequential design of computer experiments for the estimation of a probability of failure. Stat. Comput. 22(3), 773–793 (2012) 51. J.R. Gardner, M.J. Kusner, Z. Xu, K. Weinberger, J.P. Cunningham, Bayesian optimization with inequality constraints. In Proceedings of The 31st International Conference on Machine Learning, pp. 937–945 (2014) 52. D.M. Negoescu, P.I. Frazier, W.B. Powell, The knowledge gradient algorithm for sequencing experiments in drug discovery. INFORMS J. Comput. 23(1) (2011) 53. P.I. Frazier (2009–2010), http://people.orie.cornell.edu/pfrazier/src.html Chapter 4 Small-Sample Classification Lori A. Dalton and Edward R. Dougherty Abstract In a number of application areas, such as materials and genomics, where one wishes to classify objects, sample sizes are often small owing to the expense or unavailability of data points. Many classifier design procedures work well with large samples but are ineffectual or, at best, problematic with small samples. Worse yet, small-samples make it difficult to impossible to guarantee an accurate error estimate without modeling assumptions, and absent a good error estimate a classifier is useless. The present chapter discusses the problem of small-sample error estimation and how modeling assumptions can be used to obtain bounds on error estimation accuracy. Given the necessity of modeling assumptions, we go on to discuss minimum-meansquare-error (MMSE) error estimation and the design of optimal classifiers relative to prior knowledge and data in a Bayesian context. 4.1 Introduction Given several classes of objects, one of the most basic problems of engineering and statistics is making a decision as to which class an object belongs to based on some set of features. The standard approach to the problem is to utilize labeled training data sampled from the class populations as inputs to a design algorithm that yields a decision function, known as a classifier. The designed classifier is then used to make decisions regarding future unlabeled observations. Classifier design alone is insufficient: one must also use sample data to estimate the error of the classifier on the class populations. A classifier whose misclassification rate is not known to some satisfactory degree of approximation is useless. L.A. Dalton (B) The Ohio State University, Columbus, OH 43210, USA e-mail: dalton@ece.osu.edu E.R. Dougherty Texas A&M University, College Station, TX 77843, USA e-mail: edward@ece.tamu.edu © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_4 77 78 L.A. Dalton and E.R. Dougherty In application areas where data are plentiful and cheap, one can obtain a large training sample to design a classifier and a large independent test sample on which to estimate the error by the proportion of errors on the test sample. When data are limited, not only does this impact classifier design, but it forces one to use the same data for training and testing, else one would have little hope of obtaining a good classifier. With a small training set, one might still hope to design a well-performing classifier and let the estimated error decide if it is actually good. Unfortunately, error estimation is problematic with small samples; indeed, this is the most fundamental problem with small samples, which are ubiquitous in certain application areas, for instance genomics and materials where sample sizes less than 100 are commonplace. This chapter considers small-sample classification, demonstrating the issues with purely data-driven methods, and how these can be addressed using Bayesian approaches. For simplicity we restrict our attention to binary classification, where there are two classes. 4.2 Classification Classification involves a feature vector X = (X 1 , X 2 , . . . , X D ) on D-dimensional Euclidean space  D composed of random variables (features), a binary random variable Y ∈ {0, 1} (0 and 1 are called labels), and a classifier ψ :  D → {0, 1} to predict Y by ψ(X). Classification is probabilistically characterized via the joint feature-label distribution F for the pair (X, Y ). The space of all classifiers, which consists of the space of all binary functions on  D , will be denoted by F . The error ε[ψ] of ψ ∈ F is the probability of misclassification, ε[ψ] = P(ψ(X) = Y ) = E[|Y − ψ(X)|], (4.1) the probability and expectation being taken relative to F. An optimal classifier, ψBayes , is one having minimal error, εBayes , among all ψ ∈ F . ψBayes and εBayes are called a Bayes classifier and the Bayes error, respectively. A Bayes classifier, which need not be unique, and the Bayes error, depend on F. Define η0 (x) = f X,Y (x, 0)/ f X (x) and η1(x) = f X,Y (x, 1)/ f X (x), where f X,Y (x, y) and f X (x) are the densities for (X, Y ) and X, respectively. The posteriors η0 (x) and η1(x) give the probability that Y = 0 and Y = 1, respectively, given X = x. Classifier error can be expressed as   ε[ψ] = η1(x) f X (x)dx + {x|ψ(x)=0} η0 (x) f X (x)dx. {x|ψ(x)=1} (4.2) 4 Small-Sample Classification 79 The right-hand side of (4.2) is minimized by  ψBayes (x) = 1, if η1(x) ≥ η0 (x) . 0, otherwise (4.3) It follows from (4.2) and (4.3) that the Bayes error is given by  εBayes =  η1(x) f X (x)dx + {x|η1(x)<η0 (x)} η0 (x) f X (x)dx (4.4) {x|η1(x)≥η0 (x)} = E [min{η0 (X), η1(X)}] . By Jensen’s inequality, εBayes ≤ min{E[η0 (X)], E[η1(X)]} = min{P(Y = 0), P(Y = 1)} , where P(Y = y) is the prior probability that a sample point is from class y. Thus, if either prior is small, then the Bayes error is necessarily small. This occurs if one class is much more likely than the other. Each class, y ∈ {0, 1}, is described by its class-conditional distribution f X|Y (x|y). In the Gaussian model, each sample point in a given class is a column vector of D multivariate Gaussian features. In particular, the class-conditional distribution for class y is Gaussian with mean μ y and covariance matrix Σ y . Letting c = P(Y = 0), the optimal classifier is quadratic and given by  1, if gBayes (x) > 0 , (4.5) ψBayes (x) = 0, if gBayes (x) ≤ 0 where T x + bBayes , gBayes (x) = xT ABayes x + aBayes (4.6) with constant matrix ABayes , column vector aBayes and scalar bBayes given by  1  −1 Σ1 − Σ0−1 , 2 = Σ1−1 μ1 − Σ0−1 μ0 , ABayes = − aBayes bBayes     1 − c |Σ0 | 1/2 1  T −1 T −1 = − μ1 Σ1 μ1 − μ0 Σ0 μ0 + ln . 2 c |Σ1 | (4.7) When Σ = Σ0 = Σ1 , this classifier is linear and defined by T x + bBayes , gBayes (x) = aBayes (4.8) 80 L.A. Dalton and E.R. Dougherty where aBayes = Σ −1 (μ1 − μ0 ) , 1 T 1−c . bBayes = − aBayes (μ1 + μ0 ) + ln 2 c (4.9) In practice, the feature-label distribution is unknown and a classifier is designed from sample data. A common assumption, and one we make here, is that a classifier ψn is designed using a random sample Sn = {(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn )} of vector-label pairs drawn from the feature-label distribution. While random sampling is usually assumed, in some applications sampling is often not random and this leads to misapplication of classification theory developed in the framework of random sampling. For instance, with separate sampling data are drawn randomly from each class but the number from each class is set outside the sampling procedure. Separate sampling is common in biomedical areas such as genomics and this can lead to serious problems for both classifier design [1, 2] and error estimation [3] if not taken into account. Classifier design requires a procedure that operates on a sample to yield a classifier. A classification rule is a mapping Ψn : [ D × {0, 1}]n → F . Given a sample Sn , Ψn yields a designed classifier ψn = Ψn (Sn ) ∈ F . To be fully formal, one might write ψn (Sn ; X) rather than ψn (X); however, we will use the simpler notation, keeping in mind that ψn derives from a classification rule applied to a feature-label sample. Note that a classification rule is really a sequence of classification rules, each depending on the sample size, n. Under a Gaussian assumption in which the means and covariances are unknown, a natural classification rule when the covariance matrices are unequal is to replace the means and covariances in (4.7) with the sample means and covariances computed from sample data and to replace c by the estimate ĉ = n 0 /n, where n y is the number of sample points in class y. This yields the quadratic discriminant analysis (QDA) classification rule. Assuming equal covariance matrices, the linear discriminant analysis (LDA) classification rule results from replacing the means and covariance in (4.9) by the sample means and a pooled sample covariance matrix, and replacing c by ĉ = n 0 /n. Although constructed under the Gaussian assumption, QDA and LDA can be used without a Gaussian assumption and may perform fairly well so long as the true class-conditional distributions are not too far from Gaussian, and the sample is sufficiently large that the sample estimates are accurate. Since the optimal error is the Bayes error, sample-based design suffers a design cost, Δn = εn − εBayes , where εn = ε[ψn ] and εn and Δn are sample-dependent random variables. The expected design cost is E[Δn ], the expectation here being relative to the random sample drawn from F. The expected error of ψn is decomposed according to E[εn ] = εBayes + E[Δn ]. A classification rule is said to be consistent for a feature-label distribution F if Δn → 0 in the mean, meaning E[Δn ] → 0 as n → ∞. For a consistent rule, the expected design cost can be made arbitrarily small 4 Small-Sample Classification 81 for a sufficiently large amount of data. A classification rule is universally consistent if Δn → 0 in the mean for any feature-label distribution of (X, Y ). Consistency is useful for large samples, but has negligible value for small samples. A classification rule can yield a classifier that makes few errors, or even no errors, on the training data but performs poorly on the distribution as a whole, a situation called overfitting. This situation is exacerbated by the use of complex classifiers with small samples. The essential idea is that a classifier should not cut up the space too finely for the amount of training data. Overfitting can be mitigated by constraining classifier design, which means restricting classifiers to a subfamily C ⊆ F . The aim is to find an optimal constrained classifier ψC ∈ C having error εC . Since optimization in C is over a subfamily of classifiers, εC ≥ εBayes . The cost of constraint is ΔC = εC − εBayes ≥ 0. When only data is available, a classification rule yields a classifier ψn,C ∈ C , with error εn,C such that εn,C ≥ εC ≥ εBayes . The design cost for constrained classification is Δn,C = εn,C − εC . For small samples, this can be substantially less than Δn , depending on C and the classification rule. For instance, although LDA is constructed under the assumption of equal covariance matrices, with small samples it can outperform QDA when the covariance matrices are unequal because it only requires estimation of a single covariance matrix rather than two. The error of a designed constrained classifier is decomposed as εn,C = εBayes + ΔC + Δn,C . Hence, the expected error of a constrained designed classifier can be decomposed as E[εn,C ] = εBayes + ΔC + E[Δn,C ]. (4.10) The constraint is beneficial if and only if E[εn,C ] < E[εn ], that is, if ΔC < E[Δn ]− E[Δn,C ]. If the cost of constraint is less than the decrease in expected design error, then E[εn,C ] < E[εn ]. The dilemma is that strong constraint reduces E[Δn,C ] at the cost of increasing εC . A fundamental theorem provides bounds for E[Δn,C ] [4]. The idea of choosing a classifier in C that minimizes the number of errors on the sample data is known as empirical risk minimization. A distribution-free bound on the design error for any classification rule that employs empirical risk minimization is given by E[Δn,C ] ≤ 8 VC log n + 4 , 2n (4.11) where VC is a constant known as the VC (Vapnik-Chervonenkis) dimension of C (see [5] for a detailed discussion of the VC dimension). It is obvious that n must greatly exceed VC for the bound to be small. 82 L.A. Dalton and E.R. Dougherty 4.3 Error Estimation With the feature-label distribution unknown, the classifier error must be estimated by an estimation rule, Ξn , which given the random sample Sn yields an error estimate ε̂[ψn ] = Ξn (Sn ). The key issue is accuracy. Given a feature-label distribution, error estimation accuracy is commonly measured by the mean-square error (MSE), MSE(ε̂) = E[(ε̂ − ε)2 ], where for notational ease we denote ε[ψn ] and ε̂[ψn ] by ε and ε̂, respectively. The square root of the MSE is known as the root-mean-square (RMS). The expectation is relative to the sampling distribution. The MSE is decomposed into the bias, Bias(ε̂) = E[ε̂ − ε], of the error estimator relative to the true error, and the deviation variance, Var dev (ε̂) = Var(ε̂ − ε), according to MSE(ε̂) = Var dev (ε̂) + Bias(ε̂)2 . (4.12) When a large amount of data is available, the sample can be split into independent training and test sets, the error being estimated by the proportion of errors on the test data. √ For this holdout estimate, we have the distribution-free bound RMS(ε̂holdout ) ≤ 1/ 4m, where m is the size of the test sample [6]. For m = 100, and any feature-label distribution, F, we have that RMS(ε̂holdout ) ≤ 0.05. With small samples, training and error estimation must take place on the same data set. The consequences of training-set error estimation are seen in the following formula for the deviation variance: Var dev (ε̂) = σε̂2 + σε2 − 2ρσε̂ σε , (4.13) where σε̂2 , σε2 , and ρ are the variance of the error estimate, the variance of the error, and the correlation between the estimated and true errors, respectively. The deviation variance is driven down by small variances or a correlation coefficient near 1. Unfortunately, for small samples, precisely the situation when one wishes to use training-set error estimation, neither condition typically holds. Consider the popular cross-validation error estimator. For it, the error is estimated on the training data by randomly splitting the training data into k folds (subsets), Sni , for i = 1, 2, . . . , k, training k classifiers on Sn − Sni , for i = 1, 2, . . . , k, calculating the proportion of errors of each designed classifier on the appropriate left-out fold, and then averaging these proportions to obtain the cross-validation estimate of the originally designed classifier. Various enhancements are made, such as by repeating the process some number of times and averaging. Letting k = n yields the leaveone-out estimator. The problem with cross-validation is that, for small samples, it typically has large variance and little correlation with the true error. Hence, although with large number of folds cross-validation does not suffer too much from bias, it typically has large deviation variance. To illustrate with a materials dataset, consider predicting the formability of ABO3 cubic perovskites. A dataset of 223 binary oxide systems, 34 of which can form cubic perovskites, is available in [7]. From this dataset we use two features that have been 4 Small-Sample Classification 83 shown to be predictive of formability: the octahedral factor and tolerance factor. We emulate the classification and error estimation procedure by drawing a small subset of examples from the full dataset for training, while using the left out points to estimate the ground truth true error. In particular, suppose that only 50 of the 223 compounds in the full dataset are available for classifier training, 8 of which can form a cubic structure and 42 cannot (the proportion is kept close to that of the full dataset). We train a radial-basis-function support vector machine (RBF-SVM) classifier on the 50 training points, use the same 50 points to estimate the error of this classifier using cross-validation with 5 folds and 10 repetitions, and approximate the true error rate of this classifier by evaluating the proportion of misclassified points among the 173 points left out of training (note√the distribution free bound on the RMS of holdout applies here with RMS ≤ 1/ 4 × 173 ≈ 0.038). We repeat this process 10,000 times to emulate the sampling procedure, each time drawing a different training set of 50 points. A scatter plot of the cross-validation error estimates and true errors is shown in Fig. 4.1, along with the least-squares regression line. The mean of the true errors and cross-validation estimates is indicated by a solid triangle, which shows that the cross-validation estimate is approximately unbiased (in fact, slightly highbiased). Because the class sizes are so unbalanced, the classifier error should be small, in particular, if we assume that P(Y = 0) ≈ 34/223 ≈ 0.15 then εBayes is upper bounded by min{0.15, 0.85} = 0.15. Relative to the small true error, the dispersion of the scatter plot is very large. Moreover, the regression line has a slightly negative slope, certainly not a desirable property if one is going to estimate the true error by the cross-validation estimate. What we observe in Fig. 4.1 is typical for small samples: large variance [8] and negligible regression between the true and estimated errors [9]. As seen, negatively sloping regression lines are possible; indeed, for cross-validation, negative correlation between the true and cross-validation estimated errors has been mathematically demonstrated in some basic models [10]. Such error estimates are worthless and 0.2 0.18 0.16 0.14 true error Fig. 4.1 Scatter plot and linear regression between cross-validation (horizontal axis) and the true error (vertical axis) with sample size 50 for RBF-SVM classification of the formability of ABO3 cubic perovskites 0.12 0.1 0.08 0.06 0.04 0.02 0 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 cross−validation error estimate 0.18 0.2 84 L.A. Dalton and E.R. Dougherty can result in various problems that may not be immediately recognized: lack of reproducibility [11], optimistic bias when evaluating performance over several data sets [12], optimistic bias when considering several classification rules for a single classification problem [13], and inaccurate ROC curves [14]. Optimistic bias occurs because the high variance of the estimator gives a wide array of optimistic and pessimistic estimates when using different data sets or different classification rules, so that when one chooses the apparent best, he merely selects the one most optimistically biased—and bias can be severe. 4.4 Validity A pattern recognition model (ψ, εψ ) consists of a classifier ψ and an error rate εψ , where εψ is simply a real number between 0 and 1. Intuitively, one might wish to say that (ψ, εψ ) is valid for the feature-label distribution F to the extent that εψ approximates the classifier error, ε[ψ], on F, where the degree of approximation is measured by some distance between εψ and ε[ψ]. For a classifier ψn designed from a specific sample, this would mean that we want to measure some distance between the true error ε = ε[ψn ] and the estimated error ε̂ = ε̂[ψn ], say |ε − ε̂|. To know the true error we would need to know F, but if we knew F then we would use the Bayes classifier and not design a classifier from sample data. Since it is the precision of the error estimate that is of consequence, a natural way to proceed would be to characterize validity in terms of the precision of the error estimator ε̂[ψn ] = Ξn (Sn ) as an estimator of ε[ψn ], say by RMS(ε̂). This makes sense because the RMS measures the closeness of ε̂ and ε across the sampling distribution. However, to compute the RMS again we need to know F, which we do not know. One way to proceed is to find a distribution-free bound on the RMS. For instance, for the leave-one-out error estimator with the discrete histogram rule and tie-breaking in the direction of class 0 [6], RMS(ε̂loo ) ≤ 6 1 + 6/e . +√ n π (n − 1) (4.14) The discrete histogram rule applies to a finite sample space {1, 2, . . . , b}, and defines ψn (i) = 0 if training samples with value i are labeled as class 0 at least as often as they are labeled class 1, and ψn (i) = 1 otherwise. Although this bound is distributionfree, it is useless for small samples: for n = 200 this bound is 0.506. In general, there are very few cases in which distribution-free bounds are known and, when they are known, they are useless for small samples. Distribution-based bounds on the RMS are needed, which requires knowledge concerning the second-order moments of the joint distribution between the true and estimated errors. More generally, to fully understand an error estimator we need to know its joint distribution with the true error. Given that a classifier is epistemologically vacuous absent an accurate estimate of its error, one might think that over the 4 Small-Sample Classification 85 years much effort would have gone into studying the moments of the joint distribution between the true and estimated errors, especially the mixed second moment; however, this has not been the case. Going back half a century, there were some results on the mean and variance of some error estimators for the Gaussian model using LDA. In 1966, Hills obtained the expected value of the resubstitution and plug-in estimators in the univariate model with known common variance [15]. The resubstitution error estimate is simply a count of the classification errors on the training data. The plug-in estimate is found by using the data to estimate the feature-label distribution and then finding the error of the designed classifier on the estimated distribution. In 1972, Foley obtained the expected value of resubstitution in the multivariate model with known common covariance matrix [16]. In 1973, Sorum derived results for the expected value and variance for both resubstitution and leave-one-out in the univariate model with known common variance [17]. In 1973, McLachlan derived an asymptotic representation for the expected value of resubstitution in the multivariate model with unknown common covariance matrix [18]. In 1975, Moran obtained new results for the expected value of resubstitution and plug-in in the multivariate model with known covariance matrix [19]. In 1977, Goldstein and Wolf obtained the expected value of resubstitution for multinomial discrimination [20]. In 1992, Davison and Hall derived asymptotic representations for the expected value and variance of bootstrap and leave-one-out in the univariate Gaussian model with unknown and possibly different covariances [21]. Prior to 2005, we know of no other paper providing analytic results for moments of common error estimators. In total, none of these papers provide representation of the joint distribution or representation of second-order mixed moments, which are needed for the RMS. Motivated by small samples ubiquitous in genomics, efforts commenced to obtain these types of representations, in particular, for the resubstitution and leave-one-out estimators. For the multinomial model, complete enumeration was used to obtain marginal distributions for both error estimators [10], followed by the full joint distributions [22]. Subsequently, exact closed-form representations for second-order moments, including the mixed moments, were obtained, thereby providing exact RMS representations for both estimators [10]. For the Gaussian model using LDA, in 2009 exact marginal distributions for both estimators in the univariate model (with known but not necessarily equal class variances) and approximations in the multivariate model (with known and equal class covariance matrices) were obtained [23]. Subsequently, these were extended to joint distributions of the true and estimated errors in a Gaussian model [24]. Recently exact closed-form representations for the second-order moments in the univariate model without assuming equal covariances were discovered, thereby providing exact expressions of the RMS for both estimators [25]. Moreover, double asymptotic representations for the second-order moments in the multivariate model, sample size and dimension approaching infinity at a fixed rate between the two, were found, thereby providing double asymptotic expressions for the RMS [26]. Finite-sample approximations from the double asymptotic method have been shown to possess better accuracy than various simple asymptotic representations, although much more work is needed on this issue [27, 28]. 86 L.A. Dalton and E.R. Dougherty (a) (b) Fig. 4.2 a RMS (y-axis) as a function of the Bayes error (x-axis) for leave-one-out with dimension D = 10 and sample sizes n = 20, 40, 60; b maxBayes(λ) (y-axis) as a function of RMS (x-axis) corresponding to the RMS curves in part (a) To utilize mixed-moment theory, prior knowledge is required, in the sense that the actual (unknown) feature-label distribution belongs to some uncertainty class, U , of feature-label distributions. Once RMS representations have been obtained for feature-label distributions in U , distribution-based RMS bounds follow: RMS(ε̂) ≤ maxG∈U RMS(ε̂|G), where RMS(ε̂|G) is the RMS of the error estimator under the assumption that the feature-label distribution is G. We do not know the actual featurelabel distribution precisely, but prior knowledge allows us to bound the RMS. For instance, consider using LDA with a feature-label distribution having two equally probable Gaussian class-conditional densities sharing a known covariance matrix. For this model the Bayes error is a one-to-one decreasing function of the distance, m, between the means. Figure 4.2a shows the RMS to be a one-to-one increasing function of the Bayes error for leave-one-out in dimension D = 10 and sample sizes n = 20, 40, 60, the RMS and Bayes errors being on the y and x axes, respectively. Assuming a parameterized model in which the RMS is an increasing function of the Bayes error, εBayes , we can pose the following question: Given sample size n and λ > 0, what is the maximum value, maxBayes(λ), of the Bayes error such that RMS(ε̂) ≤ λ? If RMS is the measure of validity and λ represents the largest acceptable RMS for the classifier model to be considered meaningful, then the epistemological requirement is characterized by maxBayes(λ). Given the relationship between model parameters and the Bayes error, the inequality εBayes ≤ maxBayes(λ) can be solved in terms of the parameters to arrive at a necessary modeling assumption. In the preceding Gaussian example, since εBayes is a decreasing function of m, we obtain an inequality m ≥ m(λ). Figure 4.2b shows the maxBayes(λ) curves corresponding to the RMS curves in Fig. 4.2a [29]. These curves show that, assuming Gaussian class-conditional densities and a known common covariance matrix, further assumptions must be made to insure that the RMS is sufficiently small to make the classifier model meaningful. To have scientific content, small-sample classification requires prior knowledge. Regarding the feature-label distribution there are two extremes: (1) the feature-label distribution is known, in which case the entire classification problem collapses to 4 Small-Sample Classification 87 finding a Bayes classifier and Bayes error, so there is no classifier design or error estimation issue; and (2) the uncertainty class consists of all feature-label distributions, the distribution-free case, and we typically have no bound on performance, or one that is too loose for practice. In the middle ground, there is a trade-off between the size of the uncertainty class and the size of the sample. The uncertainty class must be sufficiently constrained (equivalently, the prior knowledge must be sufficiently great) that an acceptable bound can be achieved with an acceptable sample size. We have focused on cross-validation for two reasons: (1) it is probably the most commonly used training-data-based error estimator and (2) its moments, along with resubstitution, are the most studied. Another often employed re-sampling-based error estimator is bootstrap [30]. It generally has smaller variance than cross-validation; however, it can suffer from significant bias, depending on the feature-label distribution and classification rule. Analytic representation of bootstrap expectation in the Gaussian model with LDA classification has recently been found and, since the bootstrap has a weighting parameter, under these conditions it can be weighted to be unbiased [31]. In general, using its free weight together with the fact that the bootstrap is formed by a convex combination, given the model and the classification rule, the Lagrangian multiplier technique can be used determine a weight that minimizes the RMS between this optimized bootstrap and the true error [32]. Given that one needs a distributional model to assure satisfactory performance for classifier error estimation, a natural way to proceed is to define a prior distribution over the uncertainty class of feature-label distributions and then find an optimal minimum-mean-square-error (MMSE) error estimator relative to the prior [33]. This results in a Bayesian approach with the uncertainty class governed by the prior distribution and the data being used to construct a posterior distribution that quantifies everything we know about the feature-label distribution. In this way we can incorporate prior knowledge in the whole classification procedure, both classifier design and error estimation. 4.5 MMSE Error Estimation If the class-conditional distribution for class y, denoted f θ y (x|y), is parameterized by θ y , then the feature-label distribution is completely specified by the modeling parameters θ = [c, θ0 , θ1 ], where c = P(Y = 0). Writing the parameter space of θ y as Θ y , the parameter space of θ is Θ = [0, 1] × Θ0 × Θ1 . We denote the prior distribution on θ by π (θ ) and the posterior, derived from a random sample of size n with n y points from class y, by π ∗ (θ ). Here we assume that c is independent from θ0 and θ1 prior to observing the data and denote its prior by π (c). Under a given sampling method, the posterior π ∗ (c) for c may be obtained from the number of sample points in each  class using Bayes’ rule. For instance, under random sampling and assuming a beta α 0 , α 1 prior for c, the posterior of c is also beta with hyperparameters α 0 +n 0 and α 1 + n 1 . In particular, letting B be the beta function, 88 L.A. Dalton and E.R. Dougherty cα +n 0 −1 (1 − c)α +n 1 −1  ,  B α0 + n0 , α1 + n1 (4.15) n0 + α0 , n + α0 + α1 (4.16) 0 π ∗ (c) = E π ∗ [c] = 1 where E π ∗ represents expectation relative to the posterior (conditioned on the sample). A uniform prior on c is achieved with α 0 = α 1 = 1. The Bayesian framework in [33, 34] not only assumes that c is independent, but that c, θ0 and  θ1 are all independent prior to observing the data. Writing the prior for θ y as π θ y , this means that π (θ ) = π (c) π (θ0 ) π (θ1 ). We also write the posterior as π ∗ θ y , where it has been shown that independence  is preserved after observing the data, that is, π ∗ (θ ) = π ∗ (c) π ∗ (θ0 ) π ∗ (θ1 ). π ∗ θ y is proportional to the product of the prior and a likelihood function for sample points observed from the corresponding class: π ∗ (θ y ) ∝ π(θ y ) ny y f θ y (xi |y), (4.17) i=1 y where xi is the ith sample point in class y and the constant of proportionality is found by normalizing the integral of π ∗ (θ y ) to 1. When the prior is a proper density, this follows from Bayes’ rule; if π(θ y ) is improper (i.e., if the integral of π(θ y ) cannot be normalized to 1), then this is taken as a definition, but in all cases it is mandatory that π ∗ (θ y ) be a proper density. Priors quantify the information we have about the distribution before observing the data. We have the option of using flat, or non-informative, priors, as long as the posterior is a valid density function. Alternatively, informative priors can supplement the classification problem with additional information. The Bayesian model characterizes our initial uncertainty in the actual distribution through the prior. As we observe sample points, this uncertainty should converge to a certainty on the true distribution. More precisely, it has been proven in [35] that under mild regularity conditions, the posteriors converge to a point mass at the true parameters for an independent covariance Gaussian model, which we will discuss shortly. More informative priors may help the posteriors converge faster, but, essentially, as long as the prior does not assign zero probability to any a neighborhood around the true distribution, convergence is assured. The Bayesian model defines priors on the feature-label distribution itself; nevertheless, the posteriors of the distribution parameters imply a (sample-conditioned) distribution on the true classifier error. This randomness in the true error comes from our uncertainty in the underlying feature-label distribution (given the sample), which is in contrast to the classical analysis discussed in previous sections, where randomness in the true error for a fixed distribution comes only from randomness in the trained classifier through the sampling distribution. In addition, we may speak of moments of the true error for a fixed sample and classifier. 4 Small-Sample Classification 89 The true error of a designed classifier ψn may be decomposed as ε (θ, ψn ) = cε0 (θ0 , ψn ) + (1 − c)ε1 (θ1 , ψn ) , (4.18) where ε y (θ y , ψn ) is the probability that ψn mislabels a class y point under true parameter θ y . Since the Bayesian framework quantifies uncertainty in the featurelabel distribution parameters, we may find the MMSE estimate of the true error, ε̂ (ψn , Sn ), which is equal to the first moment of the true error conditioned on the observed sample [33]. We call this the Bayesian error estimate. As long as c is independent from θ0 and θ1 a posteriori, ε̂ (ψn , Sn ) = E π ∗ [ε (θ, ψn )] = E π ∗ [c]ε̂0 (ψn , Sn ) + (1 − Eπ ∗ [c])ε̂1 (ψn , Sn ) , (4.19) where ε̂ y (ψn , Sn ) = E π ∗ [ε y (θ y , ψn )] is the posterior expected error contributed by class y. Both ε̂ and ε̂ y are functions of the classifier ψn , and the sample via π ∗ . The expectation of c depends on our prior model for c, but is straightforward to find analytically. For example, if c is fixed,  expectation can be replaced  then the with the fixed value of c, and if c has a beta α 0 , α 1 prior, then E π ∗ [c] is available in (4.16). Representation for ε̂ y (ψn , Sn ) is known for the discrete and independent covariance Gaussian models [33, 34]. Owing to convergence of the posteriors, classical frequentist consistency holds for Bayesian error estimators in both models for any fixed distribution in the parameterized family [35]. We next present an example illustrating the optimal performance of MMSE error estimation. Consider a D = 5 dimensional Gaussian model with a uniform prior on c and independent arbitrary covariance matrices. In particular, we assume normalinverse-Wishart priors with hyperparameters ν y = κ y = 25, m0 = [0, 0, 0, 0, 0], m1 = [1, 0, 0, 0, 0], and Sy = 13.19I5 , where I D is a D × D identity matrix. This is a moderately informative prior where the expected mean of class y is m y and the expected covariance for both classes is 0.74132 I5 . We generate 100,000 feature-label distributions from the prior, each including a random realization for c and random μ y and Σ y pairs for each class y ∈ {0, 1}. For each fixed feature-label distribution, we generate 10 samples of a given size n ranging from 30 to 200, first determining the number of points in each class by drawing n 0 from a binomial(n, c) distribution, and then, for each class, drawing the appropriate number of i.i.d. points from a Gaussian distribution with the corresponding mean and covariance pair. From each sample we train an LDA classifier, we evaluate the true error of the trained classifier under the corresponding true feature-label distribution, and we estimate the error of this classifier using 4 training-data-based methods: the MMSE error estimator (Bayes), resubstitution (resub), cross-validation (cv), and bolstered resubstitution (bol). Bolstered resubstitution is similar to resubstitution except that each point of the training set is replaced with a density kernel and the error is estimated by integrating each kernel over the classifier decision region disagreeing with the label at the point, thereby “spreading” the incorrect mass and giving more error weight to incorrectly 90 L.A. Dalton and E.R. Dougherty RMS deviation from true error 0.1 0.09 resub 0.08 cv 0.07 bol 0.06 Bayes 0.05 0.04 0.03 0.02 0.01 40 60 80 100 120 140 160 180 200 samples Fig. 4.3 RMS deviation from true error for linear classification of Gaussian distributions, averaged over all distributions and samples using a proper prior with D = 5 labeled points near the decision boundary (see [36] for details). We then approximate an RMS for each error estimator, that is, we evaluate the square root of the average square difference between each error estimator and the true error, where the average is taken over all 100,000 feature-label distributions and 10 samples. A graph of the RMS with respect to sample size is provided in Fig. 4.3. The performance of the MMSE error estimator here, averaged over all distributions and samples under the assumed prior, is optimal, outperforming all other error estimators, as it must. This does not mean that performance is optimal for any fixed feature-label distribution, only that it is optimal on average. 4.6 Optimal Bayesian Classification An optimal Bayesian classifier (OBC) ψOBC is any classifier satisfying E π ∗ [ε(θ, ψOBC )] ≤ E π ∗ [ε(θ, ψ)] (4.20) for all ψ ∈ C , where C is a family of classifiers. Under the Bayesian framework, P (ψ (X) = Y |Sn ) = E π ∗ [P (ψ (X) = Y |θ, Sn )] = E π ∗ [ε(θ, ψ)] = ε̂ (ψ, Sn ) . Thus, optimal Bayesian classifiers minimize the misclassification probability relative to the assumed model or, equivalently, minimize the Bayesian error estimate. The following representation of the Bayesian error estimator facilitates a straightforward approach for finding an OBC [33, 34]: If ψ is a fixed classifier defined by 4 Small-Sample Classification 91 ψ (x) = 0 if x ∈ R0 and ψ (x) = 1 if x ∈ R1 , where R0 and R1 are measurable sets partitioning the sample space, then the Bayesian error estimator is given by   f (x|0) dx + (1 − E π ∗ [c]) f (x|1) dx (4.21) ε̂ (ψ, Sn ) = E π ∗ [c] R1 R0    = E π ∗ [c] f (x|0) Ix∈R1 + (1 − E π ∗ [c]) f (x|1) Ix∈R0 dx, (4.22) D where I E is an indicator function equal to one if E is true and zero otherwise, and  f (x|y) = Θy   f θ y (x|y) π ∗ θ y dθ y , (4.23) which is called the effective class-conditional density with respect to the posterior. An OBC can be found by brute force using the closed-form solutions for the expected true error (the Bayesian error estimator), when available; however, if C is the set of all classifiers (with measurable decision regions), then an OBC, in the presence of model uncertainty, can be found analogously to a Bayes classifier under a known feature-label distribution. To wit, an OBC relative to the set of all classifiers with measurable decision regions exists and is given pointwise by [37]  ψOBC (x) = 0, if E π ∗ [c] f (x|0) ≥ (1 − E π ∗ [c]) f (x|1) . 1, otherwise (4.24) To find an OBC we can average the class-conditional densities f θ y (x|y) relative to the posterior distribution to obtain the effective class-conditional density, f (x|y), whereby an OBC is found via (4.24). Essentially, the OBC is the Bayes classifier using f (x|0) and f (x|1) as the true class-conditional distributions. In regard to both optimal Bayesian classification and MMSE error estimation, f (x|y) contains all of the necessary information in the model about the classconditional distributions and we do not have to deal with the priors directly. Upon defining a model, we find f (x|y), which depends on the sample because it depends on π ∗ , and then several problems are solved by treating f (x|y) as the true distribution: optimal (unconstrained) classification, the optimal error estimate for the optimal classifier, and the optimal error estimate for arbitrary classifiers. Henceforth, we will only consider optimal Bayesian classifiers over the space of all classifiers. Moreover, note that if E π ∗ [c] = 0 then the OBC is a constant given by ψOBC = 1, and if E π ∗ [c] = 1 then ψOBC = 0. 4.7 The Gaussian Model In the Gaussian model, the uncertainty class is determined by the parameters θ y = [μ y , Λ y ], where μ y is the mean of the class-conditional distribution and Λ y is a collection of parameters that determine the covariance matrix, Σ y , of the class. 92 L.A. Dalton and E.R. Dougherty By defining Σ y as a function of Λ y , we may impose a structure on the covariance. Three types of models are considered in [37]: a fixed covariance model (Σ y = Λ y is known perfectly), a scaled identity covariance model having uncorrelated features with equal variances (Λ y = σ y2 is a scalar and Σ y = σ y2 I D ), and an arbitrary (valid) covariance model (Σ y = Λ y may be any invertible covariance matrix). Here we consider the known and arbitrary-covariance models in detail. If the arbitrary covariance model is used in both classes, then we assume that the covariance matrices in each class are independent. The parameter space of μ y is  D , and the parameter space of Λ y must be carefully defined to permit only valid covariance matrices. As Σ y and Λ y are equivalent in the cases we will consider, we will write Σ y in place of Λ y without explicitly showing its dependence on Λ y , i.e., we write Σ y rather than Σ y (Λ y ). We also denote a multivariate Gaussian distribution with mean μ and covariance Σ by f μ,Σ (x), so that the parameterized class-conditional distributions can be written as f θ y (x|y) = f μ y ,Σ y (x). Under the independence assumption, c, θ0 = [μ0 , Σ0 ] and θ1 = [μ1 , Σ1 ] are all independent prior to observing the data, so that π(θ ) = π(c)π(θ0 )π(θ1 ). Assuming π(c) and π ∗ (c) have been established, we must define priors π(θ y ) and find posteriors π ∗ (θ y ) for both classes. We begin by specifying conjugate priors for θ0 and θ1 . Define ν  f m (μ; ν, m, Σ) = |Σ|−1/2 exp − (μ − m)T Σ −1 (μ − m) , 2     1 −(κ+D+1)/2 exp − trace SΣ −1 , f c (Σ; κ, S) = |Σ| 2 (4.25) (4.26) which involve several constants: ν, m, κ and S. If ν > 0, then f m is a (scaled) Gaussian distribution with mean m and covariance Σ/ν. If κ > D − 1 and S is symmetric and positive definite, then f c is a (scaled) inverse-Wishart (κ, S) distribution. However, to allow for improper priors we do not necessarily require f m and f c to be normalizable. Consider class y ∈ {0, 1}. In the arbitrary covariance model, we assume Σ y is invertible with probability 1 and that for invertible Σ y the prior for θ y is of the form π(θ y ) = π(μ y |Σ y )π(Σ y ), (4.27) π(μ y |Σ y ) ∝ f m (μ y ; ν y , m y , Σ y ), π(Σ y ) ∝ f c (Σ y ; κ y , S y ), (4.28) (4.29) where ν y is a real number, m y is a length D real vector, κ y is a real number, and S y is a symmetric non-negative definite D × D matrix. If ν y > 0, then the prior for the mean conditioned on the covariance, π(μ y |Σ y ), is proper and Gaussian with mean m y and covariance Σ y /ν y . The hyperparameter m y is the prior expected mean of class y, where the larger ν y is the more confident we are that μ y is close to m y . 4 Small-Sample Classification 93 In the arbitrary covariance model, π(Σ y ) is a proper inverse-Wishart distribution if κ y > D − 1 and S y is symmetric and positive definite. If in addition ν y > 0, then π(θ y ) is a normal-inverse-Wishart distribution, which is the conjugate prior for the mean and covariance when sampling from normal distributions [38, 39]. As long as κ y > D + 1, the prior mean of Σ y exists and is given by E π [Σ y ] = S y /(κ y − D − 1). Thus, S y determines the expected shape of the covariance, where the actual expected covariance is scaled. If S y is scaled appropriately, then the larger κ y is the more certainty we have about the covariance Σ y . In this model, the posterior has the same form as the prior [34], π ∗ (θ y ) ∝ f m (μ y ; ν y∗ , m∗y , Σ y ) f c (Σ y ; κ y∗ , S y∗ ), (4.30) with updated hyperparameters ν y∗ = ν y + n y , m∗y = (4.31) ν y m y + n y μ̂ y , νy + n y (4.32) κ y∗ = κ y + n y , (4.33) νy n y S y∗ = S y + (n y − 1)Σ̂ y + (μ̂ y − m y )(μ̂ y − m y )T , νy + n y (4.34) where μ̂ y and Σ̂ y are the sample mean and sample covariance of the n y training points in class y. Improper priors can still be used so long as the posterior is proper: for a proper posterior in the arbitrary covariance model we require ν y∗ > 0, κ y∗ > D − 1, and that S y∗ is symmetric and positive definite. The previous discussion on properties of a proper prior again apply to the posterior, namely that π ∗ (μ y |Σ y ) must be a valid Gaussian distribution and π ∗ (Σ y ) must be a valid inverse-Wishart distribution. Continuing with the arbitrary covariance model, the parameter space of θ y is the product of the space of all valid mean vectors,  D , and the space of all positivedefinite matrices, which we denote by Σ y > 0. By definition,  f (x|y) =  Σ y >0 D f μ y ,Σ y (x) π ∗ (μ y |Σ y )π ∗ (Σ y )dμ y dΣ y . (4.35) Given that π ∗ (μ y |Σ y ) is Gaussian and π ∗ (Σ y ) is inverse-Wishart, one can show that evaluation of the double integral yields a multivariate student’s t-distribution [34]: 1 f (x|y) = D/2 × k y π D/2 |Ψ y |1/2 Γ k y +D 2  ky Γ 2  k +D  T − y 2 1 ∗ −1 ∗ 1+ x − my Ψy x − my , ky (4.36) 94 L.A. Dalton and E.R. Dougherty ν y∗ +1 S ∗ and k y = κ y∗ −D+1 degrees (κ y∗ −D+1)ν y∗ y ν y∗ +1 because (κ ∗ −D+1)ν ∗ > 0 and S y∗ is symmetric y y with location vector m∗y , scale matrix Ψ y = of freedom. This distribution is proper and positive definite (so the scale matrix is symmetric and positive definite) and κ y∗ − D + 1 > 0. As long as κ y∗ > D the mean of this distribution is m∗y , and as long ν ∗ +1 y ∗ as κ y∗ > D + 1 the variance is (κ ∗ −D−1)ν ∗ Sy . y y Switching gears to the known covariance model, now assume that the prior for θ y is of the form (4.37) π(θ y ) = π(μ y |Σ y )π(Σ y ), where π(μ y |Σ y ) is given in (4.28) and π(Σ y ) is simply a point mass at the known value of Σ y . Again, we require that ν y be a real number and m y be a length D real vector, where if ν y > 0 then the prior for the mean is proper and Gaussian with mean m y and covariance Σ y /ν y . Also as before, the posterior has the same form as the prior with the same hyperparameter update equations, (4.31) and (4.32). For a proper posterior, we require ν y∗ > 0. In the known covariance model, we may simplify the effective density in (4.35) as,  f (x|y) = D f μ y ,Σ y (x) π ∗ (μ y |Σ y )dμ y , (4.38) where Σ y is the known covariance matrix. One can show that this integral yields a ν ∗ +1 proper Gaussian distribution with mean m∗y and covariance yν ∗ Σ y [34]: y f (x|y) = (ν y∗ ) D/2 (ν y∗ + 1) D/2 (2π ) D/2 |Σ y |1/2  exp − ν y∗ 2(ν y∗ + 1) T  x − m∗y Σ y−1 x − m∗y . (4.39) 4.8 Optimal Bayesian Classifier in the Gaussian Model There are three cases to consider when finding the OBC: the covariances are known in both classes, a covariance is known in only one class, and the covariances are unknown in both classes (for derivation details see [37]). It is interesting to consider the shape of the decision boundary for the OBC as compared to the shapes of the decision boundaries for each feature-label distribution in the uncertainty class; in particular, note how the effective class-conditional distributions become multivariate student’s t-distributions. When both covariances are known, in the previous section we showed that the effective class-conditional distributions are Gaussian with mean m∗y and covariance ν y∗ +1 Σy ν y∗ for y ∈ {0, 1}. The OBC, ψOBC (x), is the optimal classifier between the effective Gaussians with class 0 probability E π ∗ [c], and is of the same form as the 4 Small-Sample Classification 95 Bayes classifier in (4.5) and (4.6) with discriminant gOBC (x) given by AOBC aOBC   ν1∗ 1 ν0∗ −1 −1 , Σ − ∗ Σ =− 2 ν1∗ + 1 1 ν0 + 1 0 ν∗ ν∗ = ∗ 1 Σ1−1 m1∗ − ∗ 0 Σ0−1 m0∗ , ν1 + 1 ν0 + 1   ν1∗ 1 ν0∗ ∗ T −1 ∗ ∗ T −1 ∗ m Σ m1 − ∗ m Σ m0 bOBC = − 2 ν∗ + 1 1 1 ν0 + 1 0 0  1     1 − Eπ ∗ [c] ν1∗ (ν0∗ + 1) D/2 |Σ0 | 1/2 + ln . Eπ ∗ [c] ν0∗ (ν1∗ + 1) |Σ1 | (4.40) The expected true error for the OBC is simply the true error for this quadratic classifier under the effective Gaussian distributions. If the covariance is known in only one class and modeled as arbitrary in the other, then the effective class-conditional distribution for the known class, say class 0, is Gaussian and the other class is a multivariate student’s t-distribution, hence, f (x|0) =   ν0∗ ∗ T Σ −1 x − m∗  , x − m exp − 0 0 0 2(ν0∗ + 1) (ν0∗ + 1) D/2 (2π ) D/2 |Σ0 |1/2 (ν0∗ ) D/2 (4.41)    k1 +D Γ k1 +D T   − 2 1  1 2  1+ f (x|1) = D/2 × , x − m1∗ Ψ1−1 x − m1∗ k1 k1 π D/2 |Ψ1 |1/2 Γ k21 (4.42) ν ∗ +1 ∗ ∗ 1 and, from (4.36), Ψ1 = (κ ∗ −D+1)ν ∗ S1 and k 1 = κ1 − D + 1. The discriminant of the 1 1 OBC can be simplified to T   ν0∗  x − m0∗ Σ0−1 x − m0∗ +1   T   1  − (k1 + D) ln 1 + x − m1∗ Ψ1−1 x − m1∗ + K , k1 gOBC (x) = ν0∗ where  1 − Eπ ∗ [c] K = 2 ln Eπ ∗ [c]  2(ν0∗ + 1) ν0∗ k1  D/2  |Σ0 | |Ψ1 | 1/2 Γ ((k1 + D)/2) . Γ (k1 /2) The form of the OBC is not necessarily linear or quadratic. (4.43) 96 L.A. Dalton and E.R. Dougherty When the covariances of both classes are unknown and arbitrary, the effective class-conditional distribution for each class is multivariate student’s t with location vector m∗y , scale matrix Ψ y and k y degrees of freedom, as given in (4.36). The discriminant of the OBC can be simplified to   T   k0 +D 1  gOBC (x) = K 1 + x − m0∗ Ψ0−1 x − m0∗ k0      k1 +D 1  −1 ∗ T ∗ x − m1 Ψ1 x − m1 − 1+ , k1 (4.44) where  K = 1 − Eπ ∗ [c] Eπ ∗ [c] 2  k0 k1 D |Ψ0 | |Ψ1 |  Γ (k0 /2)Γ ((k1 + D)/2) Γ ((k0 + D)/2)Γ (k1 /2) 2 . (4.45) This classifier has a polynomial decision boundary that is not necessarily linear or quadratic as long as k0 and k1 are integers, which is satisfied for arbitrary covariance models with independent covariances if κ0 and κ1 are integers. Consider an example with D = 2 features, where each class is equally likely (c = 0.5) and the class-conditional distributions are known to be Gaussian with unequal and arbitrary invertible covariances. We assume that the mean and covariance pairs associated with each class are independent and given by a normal-inverseWishart prior with hyperparameters ν0 = ν1 = 0, κ0 = κ1 = 0, m0 = m1 = [0, 0] and S0 = S1 = −2I2 . Further suppose that we observe 40 sample points from class 0 and 4 sample points from class 1, where the sample mean of class 0 is [0, 0], the sample mean of class 1 is [1, 1], and the sample covariance of both classes is I2 . Then the posteriors are proper normal-inverse-Wishart distributions given by hyperparameters ν0∗ = κ0∗ = 40, m0 ∗ = [0, 0], S0∗ = 37I2 , ν1∗ = κ1∗ = 4, m1 ∗ = [1, 1], and S1∗ = I2 . We will consider three classifiers. The first is a plug-in classifier, which substitutes the posterior expected means and covariances into the Bayes classifier, that is, we assume that μ0 is E π ∗ [μ0 ] = m0∗ = [0, 0], μ1 is m1∗ = [1, 1], and Σ y is 1 S y∗ = I2 . Since the expected covariances are equal, this classiE π ∗ [Σ y ] = κ ∗ −D−1 y fier is linear. Note for this prior the posterior expected parameters coincide with the sample means and covariances, so that the plug-in classifier is also equivalent to an LDA classifier. The second classifier that we consider is a state-constrained optimal Bayesian classifier (SCOBC), which is found by searching across mean and covariance pairs in the uncertainty class for a Bayes classifier having minimal expected error [40]. Since the Bayes classifier for any Gaussian distribution is quadratic, the SCOBC is quadratic. Finally, we have the optimal Bayesian classifier, which is available in closed form in (4.44). Since the effective densities are not Gaussian but multivariate student’s t-distributions, the OBC has a polynomial decision boundary of greater than quadratic order. Figure 4.4 shows the plug-in classifier (light gray), SCOBC (dark gray) and OBC (black). Level curves for the class-conditional distributions corresponding to the expected parameters in the posteriors used in the plug-in rule are 4 Small-Sample Classification 97 3 2 x2 1 0 −1 −2 −2 plug-in SCOBC OBC −1 0 1 2 3 x1 Fig. 4.4 Classifiers for an independent arbitrary covariance Gaussian model with D = 2 features and proper posteriors. The optimal Bayesian classifier is polynomial with expected true error 0.2007 (averaged over the posterior on the uncertainty class of states), the state-constrained optimal Bayesian classifier is quadratic with expected true error 0.2061 and the plug-in classifier is linear with expected true error 0.2078 shown in light gray dashed lines, and level curves for the distributions corresponding to the optimal parameters found in the SCOBC are shown in dark gray dashed lines. Each classifier is quite distinct, and in particular, the optimal Bayesian classifier is non-quadratic even though all class-conditional distributions in the uncertainty class are Gaussian. 4.9 Concluding Remarks This chapter follows a natural progression: with small samples distributional knowledge has to be applied to obtain performance bounds, without which a classifier is epistemologically meaningless, and once distributional knowledge is assumed the obvious path to take is to engage in optimal error estimation and optimal classifier design. There is nothing surprising about these developments. As far back as 1925, R.A. Fisher wrote, “Little experience is sufficient to show that the traditional machinery of statistical processes is wholly unsuited to the needs of practical research. Not only does it take a cannon to shoot a sparrow, but it misses the sparrow! The elaborate mechanism built on the theory of infinitely large samples is not accurate enough for simple laboratory data. Only by systematically tackling small sample problems on their merits does it seem possible to apply accurate tests to practical data.” [41] 98 L.A. Dalton and E.R. Dougherty Before closing, let us mention some technical issues relating to the Bayesian theory. For the Gaussian model, the effective class-conditional distributions, the MMSE error estimate for linear classifiers, and the OBC can be found analytically; in particular, the posterior distribution has the same form as the prior. Although not covered herein, similar comments apply to the discrete multinomial model with Dirichlet priors. However, closed-form analytic solutions are not generally possible. For instance, in the Gaussian model with nonlinear classifiers there is no analytic expression for the MMSE error estimator and Monte Carlo methods must be employed [42]. Leaving the Gaussian model, one typically needs to employ numerical methods—for instance, Markov-chain-Monte-Carlo (MCMC) methods have been used to find the OBC with a hierarchical Poisson model [43]. A fundamental problem for any Bayesian approach is prior construction. Historically, various methods have been proposed to construct prior probabilities in different contexts [44–48]; however, these are general methodologies in that they do not target any specific type of prior information. If one tailors a prior to a specific problem in hand, then one can do better. For instance, in genomics biological knowledge in the form of regulatory pathways can be translated into feature-label knowledge for classification. This has been achieved for Gaussian network models [49], thereby significantly improving classification accuracy. The basic idea is that regulatory control constrains the feature-label distribution, in particular, the correlation between certain features in the Gaussian model. Priors are built according to the heuristic that there should be maximum uncertainty in the prior, given the regulatory constraints. Under very general conditions, the posterior π ∗ (θ ) converges to the true value of θ as the sample size goes to infinity, but this is of little interest when samples are small. What is of interest is the degree of uncertainty as it relates to classification accuracy. An obvious measure of uncertainty is the entropy of the posterior; however, what really matters is the uncertainty relating to our objective, not simply uncertainty in general. To this end, one can define the objective cost of uncertainty, which relates to the loss of classification performance obtained by the OBC relative to the performance should one know the true feature-label distribution [50]. In closing, we point out a critical advantage of Bayesian MMSE error estimation over classical non-Bayesian estimators. For standard data-driven error estimators, nothing can be said about the MSE of an error estimator given the sample. One can only compute the MSE as an expectation over all samples. However, in a Bayesian framework, one can compute the sample-conditioned MSE for a Bayesian error estimate, ε̂, on a fixed classifier, ψn . This is equivalent to the variance of the true error conditioned on the observed sample [51], MSE(ε̂|Sn ) = Var π ∗ (ε(θ, ψn )) , (4.46) where the variance is taken with respect to π ∗ (θ ). The sample-conditioned MSE converges to zero with probability 1 in both the discrete multinomial and independent covariance Gaussian models and closed-form expressions for the MSE are available [35]. 4 Small-Sample Classification 99 References 1. T.W. Anderson, Classification by multivariate analysis. Psychometrika 16(1), 31–50 (1951) 2. M.S. Esfahani, E.R. Dougherty, Effect of separate sampling on classification accuracy. Bioinformatics 30(2), 242–250 (2014) 3. U.M. Braga-Neto, A. Zollanvari, E.R. Dougherty, Cross-validation under separate sampling: optimistic bias and how to correct it. Bioinformatics 30(23), 3349–3355 (2014) 4. V.N. Vapnik, A. Chervonenkis, Theory of Pattern Recognition (Nauka, Moscow, 1974) 5. I. Shmulevich, E.R. Dougherty, Genomic Signal Processing (Princeton University Press, Princeton, 2007) 6. L. Devroye, L. Györfi, G. Lugosi, A Probabilistic Theory of Pattern Recognition, Stochastic Modelling and Applied Probability (Springer, New York, 1996) 7. C. Li, K.C.K. Soh, P. Wu, Formability of ABO3 Perovskites. J. Alloys Compd. 372(1), 40–48 (2004) 8. U.M. Braga-Neto, E.R. Dougherty, Is cross-validation valid for small-sample microarray classification? Bioinformatics 20(3), 374–380 (2004) 9. B. Hanczar, J. Hua, E.R. Dougherty, Decorrelation of the true and estimated classifier errors in high-dimensional settings. EURASIP J. Bioinform. Syst. Biol. Article ID 38473, 12 pp (2007) 10. U. Braga-Neto, E.R. Dougherty, Exact performance of error estimators for discrete classifiers. Pattern Recognit. 38(11), 1799–1814 (2005) 11. M.R. Yousefi, E.R. Dougherty, Performance reproducibility index for classification. Bioinformatics 28(21), 2824–2833 (2012) 12. M.R. Yousefi, J. Hua, C. Sima, E.R. Dougherty, Reporting bias when using real data sets to analyze classification performance. Bioinformatics 26(1), 68–76 (2010) 13. M.R. Yousefi, J. Hua, E.R. Dougherty, Multiple-rule bias in the comparison of classification rules. Bioinformatics 27(12), 1675–1683 (2011) 14. B. Hanczar, J. Hua, C. Sima, J. Weinstein, M. Bittner, E.R. Dougherty, Small-sample precision of ROC-related estimates. Bioinformatics 26, 822–830 (2010) 15. M. Hills, Allocation rules and their error rates. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 28(1), 1–31 (1966) 16. D. Foley, Considerations of sample and feature size. IEEE Trans. Inf. Theory 18(5), 618–626 (1972) 17. M.J. Sorum, Estimating the conditional probability of misclassification. Technometrics 13, 333–343 (1971) 18. G.J. McLachlan, An asymptotic expansion of the expectation of the estimated error rate in discriminant analysis. Aust. J. Stat. 15(3), 210–214 (1973) 19. M. Moran, On the expectation of errors of allocation associated with a linear discriminant function. Biometrika 62(1), 141–148 (1975) 20. M. Goldstein, E. Wolf, On the problem of bias in multinomial classification. Biometrics 33, 325–331 (1977) 21. A. Davison, P. Hall, On the bias and variability of bootstrap and cross-validation estimates of error rates in discrimination problems. Biometrica 79, 274–284 (1992) 22. Q. Xu, J. Hua, U.M. Braga-Neto, Z. Xiong, E. Suh, E.R. Dougherty, Confidence intervals for the true classification error conditioned on the estimated error. Technol. Cancer Res. Treat. 5, 579–590 (2006) 23. A. Zollanvari, U.M. Braga-Neto, E.R. Dougherty, On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers. Pattern Recognit. 42(11), 2705–2723 (2009) 24. A. Zollanvari, U.M. Braga-Neto, E.R. Dougherty, On the joint sampling distribution between the actual classification error and the resubstitution and leave-one-out error estimators for linear classifiers. IEEE Trans. Inf. Theory 56(2), 784–804 (2010) 25. A. Zollanvari, U.M. Braga-Neto, E.R. Dougherty, Exact representation of the second-order moments for resubstitution and leave-one-out error estimation for linear discriminant analysis in the univariate heteroskedastic Gaussian model. Pattern Recognit. 45(2), 908–917 (2012) 100 L.A. Dalton and E.R. Dougherty 26. A. Zollanvari, U.M. Braga-Neto, E.R. Dougherty, Analytic study of performance of error estimators for linear discriminant analysis. IEEE Trans. Signal Process. 59(9), 4238–4255 (2011) 27. F. Wyman, D. Young, D. Turner, A comparison of asymptotic error rate expansions for the sample linear discriminant function. Pattern Recognit. 23, 775–783 (1990) 28. V. Pikelis, Comparison of methods of computing the expected classification errors. Autom. Remote Control 5, 59–63 (1976) 29. E.R. Dougherty, A. Zollanvari, U.M. Braga-Neto, The illusion of distribution-free small-sample classification in genomics. Curr. Genomics 12(5), 333–341 (2011) 30. B. Efron, Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc. 78(382), 316–331 (1983) 31. T. Vu, C. Sima, U.M. Braga-Neto, E.R. Dougherty, Unbiased bootstrap error estimation for linear discriminant analysis. EURASIP J. Bioinform. Syst. Biol. 2014(1), 15 (2014) 32. C. Sima, E.R. Dougherty, Optimal convex error estimators for classification. Pattern Recognit. 39, 1763–1780 (2006) 33. L.A. Dalton, E.R. Dougherty, Bayesian minimum mean-square error estimation for classification error-Part I: Definition and the Bayesian MMSE error estimator for discrete classification. IEEE Trans. Signal Process. 59(1), 115–129 (2011) 34. L.A. Dalton, E.R. Dougherty, Bayesian minimum mean-square error estimation for classification error-Part II: The Bayesian MMSE error estimator for linear classification of Gaussian distributions. IEEE Trans. Signal Process. 59(1), 130–144 (2011) 35. L.A. Dalton, E.R. Dougherty, Exact sample conditioned MSE performance of the Bayesian MMSE estimator for classification error-Part II: Consistency and performance analysis. IEEE Trans. Signal Process. 60(5), 2588–2603 (2012) 36. U. Braga-Neto, E. Dougherty, Bolstered error estimation. Pattern Recognit. 37(6), 1267–1281 (2004) 37. L.A. Dalton, E.R. Dougherty, Optimal classifiers with minimum expected error within a Bayesian framework-Part I: Discrete and Gaussian models. Pattern Recognit. 46(5), 1301– 1314 (2013) 38. M.H. DeGroot, Optimal Statistical Decisions (McGraw-Hill, New York, 1970) 39. H. Raiffa, R. Schlaifer, Appl. Stat. Decis. Theory (MIT Press, Cambridge, 1961) 40. E.R. Dougherty, J. Hua, Z. Xiong, Y. Chen, Optimal robust classifiers. Pattern Recognit. 38(10), 1520–1532 (2005) 41. R.A. Fisher, Statistical Methods for Research Workers (Oliver and Boyd, Edinburgh, 1925) 42. L.A. Dalton, E.R. Dougherty, Application of the Bayesian MMSE estimator for classification error to gene expression microarray data. Bioinformatics 27(13), 1822–1831 (2011) 43. J.M. Knight, I. Ivanov, E.R. Dougherty, MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: Model-based RNA-Seq classification. BMC Bioinform. 15(1), 401 (2014) 44. J.M. Bernardo, Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. Ser. B (Methodol.), 113-147 (1979) 45. J. Rissanen, A universal prior for integers and estimation by minimum description length. Ann. Stat. 416-431 (1983) 46. J.C. Spall, S.D. Hill, Least-informative Bayesian prior distributions for finite samples based on information theory. IEEE Trans. Autom. Control 35(5), 580–583 (1990) 47. J.O. Berger, J.M. Bernardo, On the development of reference priors. Bayesian Stat. 4(4), 35–60 (1992) 48. R.E. Kass, L. Wasserman, The selection of prior distributions by formal rules. J. Am. Stat. Assoc. 91(435), 1343–1370 (1996) 49. M.S. Esfahani, E. Dougherty, Incorporation of biological pathway knowledge in the construction of priors for optimal Bayesian classification. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(1), 202–218 (2014) 50. B.-J. Yoon, X. Qian, E.R. Dougherty, Quantifying the objective cost of uncertainty in complex dynamical systems. Signal Process., IEEE Trans. 61(9), 2256–2266 (2013) 4 Small-Sample Classification 101 51. L.A. Dalton, E.R. Dougherty, Exact sample conditioned MSE performance of the Bayesian MMSE estimator for classification error-Part I: Representation. IEEE Trans. Signal Process. 60(5), 2575–2587 (2012) Chapter 5 Data Visualization and Structure Identification J.E. Gubernatis Abstract For three datasets, all dealing with materials with ABO3 chemistries, the two data visualizations algorithms of Tsafrir et al. [Bioinformatics 21, 2301 (2005)] were studied and applied. These algorithms permute the distance matrix associated with the data in a way to unveil structure in one case by keeping large-distanced information afar or in the other case by keeping small-distanced information near. Modifications to their proposed numerical implementations were made to enhance effectiveness. The two algorithms were used both in space of the materials and the features, looking for groupings of features and materials. In general, for the datasets considered, when visualized, the features tended to show more distinctive structure (clustering) than the materials. For enhanced grouping of materials, the initial studies point to the importance of feature selection. 5.1 Introduction The pre-emptive focus of Materials Informatics is gathering materials data and extracting from them sign-posts for candidate materials with enhanced properties. We studied three datasets, previously used in materials informatics studies [1] that had similar objectives, literally asking, What does the data look like? To assist in visualizing the data, we used the recent work by Tsafrir et al. [2] in bioinformatics that presented two seemingly simple algorithms to visualize the data in a way that also revealed structure in them, that is, correlations among the materials and features being visualized. Their algorithms reorder the data by permuting the rows and columns of a distance matrix constructed from the data matrix. The permutations minimize a cost function that favors placing data close together when the distances between them are small. With addenda, the algorithms become clustering methods which do not a priori assume the number of clusters [3, 4]. J.E. Gubernatis (B) Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA e-mail: jg@lanl.gov © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_5 103 104 J.E. Gubernatis The distance matrix D is usually formed after normalizing the data. Normalization is necessary because the different features have different units and the values of different features vary by orders of magnitude. The rows and columns of the data set tables are viewed as an M × F matrix of materials and features. Each column of features is regarded as a M−vector which is normalized by first computing the mean value of its components and subtracting the mean from each component. Next, the standard deviation for each mean-centered column is computed and each component is divided by it. After normalization, the units have disappeared and the numerical values in each column of data have the same center (mean of zero) and the same range (variance of unity). The result is a new M × F-dimensional Data matrix, Data = (f 1 , f 2 , . . . , f F ). Various definitions of a distance matrix exist. A Euclidean distance is the only type considered in this report. We computed a Euclidean distance matrix from the normalized data matrix in two ways. One way is what we call computing distances in Materials Space. In this space, the components of the F × F distance matrix are Di j =  (f i − f j ) · (f i − f j ). (5.1) Here, each feature i is regarded as a M-dimensional vector f i of materials. The second way is what we call computing distances in Features Space. Here, we work with the rows of the normalized Data matrix instead of the columns: Data = (m1 , m2 , . . . mM )T , where each material i is a F-dimensional vector mi of features. In this space, the components of the M × M distance matrix are Di j =  (mi − m j ) · (mi − m j ). (5.2) We illustrate these vectors schematically in Fig. 5.1. We applied the Tsafriar et al. algorithms to the data in both spaces. These algorithms reorder the data so the points in the respective space are closer together than they are in the tables. In Materials Space, they group together features; in Features Space, they group together materials. Is there anything to be gained by viewing the data in these two different ways? 5.2 Theory The Tsafrir et al. algorithms find a permutation matrix P that minimizes F(P) = Tr(PDPT W). (5.3) A permutation matrix is a matrix whose elements are all zero except for one element in each row and column that is unity. The permutation matrix P is usually represented as an integer array IP where IP(i) = j which gives the non-zero column j for the ith row. 5 Data Visualization and Structure Identification (a) 105 (b) Fig. 5.1 Different spaces in which to represent the data. a Space of the materials: The materials point to a feature. b The space of the features: The features point to a material Different W matrices define the two different algorithms [2]. The first algorithm is called “Side-to-Side (STS)”. Here the matrix elements of W are wi j = xi x j (5.4) where the xi are components of any vector X that satisfy xi < x j . The W created from this vector pushes apart data separated by large distances. For the results reported here, we used the choice of Tsafrir et al. X =(−N/2, −N/2 + 1, · · · , N/2), although we found that multiplying this vector by a factor of two, three, or four often gave more satisfying results. Shifting the vector so all its components are positive, typically degraded the results. Their second algorithm is called “Neighborhood (NBRHD)”. Here the matrix elements of W are wi j = exp(−|i − j|/σ 2 ). (5.5) This choice pulls together data separated by small distances. Both minimizations are NP-hard problems, meaning obtaining a good (nonunique) local minimum the best that one can expect. As the order of the matrix increases, many good minima can exist. Side-to-Side belongs to a class of problems called quadratic assignment problems; Neighborhood, linear assignment problems. Tsafrir et al. give a numerical procedure to do the minimization for each choice of W . The overall advice was to restart each procedure multiple times, each from a different point, and keep the lowest value. For Neighborhood, they note that the parameter σ could be used as an “annealing” parameter: get a solution for a small value, use that solution for a larger value, and then repeat these two steps ten times or so. Because the 106 J.E. Gubernatis W of Side-to-Side factorizes, the algorithm they proposed for it scales as the square of the order of the matrices; for Neighborhood, their proposed procedure scales as the cube. We found the suggested numerical procedures of Tsafrir et al. gave mixed performance for the datasets under study. For each W , we instead used Algorithm 1. Initialize t = 0, Pt−1 = 0, Pt = I, and W t = W . while Pt = Pt−1 do t ←t +1 Solve for Pt = arg minP Tr PDWt−1 W t+1 = [Pt ]T end while D ← Pt D[Pt ]T Algorithm 1: Minimization Procedure This algorithm modestly differs from that of Tsafrir et al.’s Neighborhood algorithms in the following respects: First, we are using it for both the Side-to-Side and Neighborhood W . In other words, we are treating each problem as if it were a linear assignment problem. The most important difference is our step “Solve ...”: Instead of using their suggested procedures to find the permutation matrix, we are using the Hungarian algorithm [5], a standard method for assignment problems. Our stopping criterion also differs: Instead of using |F(Pt+1 ) − F(Pt )| < ε, or something similar, we are iterating until the permutation matrix ceases to change. This criterion in general forced more iteration steps and produced a lower value for the minimum. Restarts for the new algorithm for the cases considered seemed to gain little. We also tried minimizing by using a greedy Monte Carlo optimization procedure (making many random permutations many times and keeping the best) and a rudimentary simulated annealing optimization. In general, the results from the Hungarian algorithm had the tightest structure in the visualization. 5.3 Results We now report select results. All were obtained with Algorithm 1. The three datasets studied were used by Balachandran et al. [1]. We call the datasets piezo.dat, pls.dat, and tree.dat. All the materials in the dataset piezo.dat are known ferroelectrics. Balachandran et al. used this data in a feature-reduction principal component analysis. The pls.dat data was used for a partial least-squares (PLS) analysis of the piezoelectric data, after a further somewhat ad hoc feature reduction, to generate an analytic expression for the Curie temperature which they then used to predicted possible Curie temperatures for perovskite chemistries not yet known to exist. All the materials in tree.dat had the ABO3 chemistry but not all had a perovskite crystal structure. This data was used by Balachandran et al. to construct a binary decision tree giving rules for when a perovskite crystal structure should exist. 5 Data Visualization and Structure Identification 107 5.3.1 The Piezo Data This dataset has 22 materials and 31 features. In Fig. 5.2 are the distance matrices before the re-ordering of the data, and in Figs. 5.3 and 5.4 are these matrices after the re-ordering. Viewing Figs. 5.2 and 5.3 or Figs. 5.2 and 5.4 together, one can see small groupings of materials and features. We can relate these groupings to materials or features having the same A or B atoms. Otherwise, larger clumping of materials or features is not prevalent. The initial distance matrix in Fig. 5.2 shows only minor block structure near the diagonal (zero distance), with a bit more in Features Space (left) than in Materials Space (right). The Side-to-Side ordering produced more distinct clumping in Features Space than in Materials Space. The Neighborhood ordering produced very distinct clumping in Materials Space. Fig. 5.2 Distance matrices for piezo.dat before reordering Fig. 5.3 Distance matrices after re-ordering with Side-to-Side 108 J.E. Gubernatis Fig. 5.4 Distance matrices after re-ordering with Neighborhood. σ = 10 Fig. 5.5 Distance matrices for pls.dat before reordering 5.3.2 The Pls Data This dataset has 21 materials and 7 features. Figure 5.5 is the distance matrices before re-ordering, and Figs. 5.6 and 5.7 are these matrices after re-ordering. Viewing Figs. 5.5 and 5.6 or Figs. 5.5 and 5.7 together, one sees less clumping than seen in Figs. 5.3 or 5.4. Presumably this is caused by simply having fewer features. The initial distance matrices in Fig. 5.5 show little block structure along the diagonal. The Side-to-Side ordering produced distinct clumping in the Features and Materials Spaces. Neighborhood ordering produced virtually identical clumpings. 5.3.3 The Tree Data Here, there are 355 materials and 13 features. Figure 5.8 is the distance matrices before re-ordering, and Figs. 5.9 and 5.10 are these matrices after re-ordering. As for 5 Data Visualization and Structure Identification 109 Fig. 5.6 Distance matrices after re-ordering with Side-to-Side Fig. 5.7 Distance matrices after re-ordering with Neighborhood. σ = 10 the pls.dat, the number of features is smaller than the number of materials. Here, their number is much smaller. In Materials Space, Fig. 5.8 shows some clear block structure along the diagonal. The Side-to-Side ordering tightened the clumping a bit in Materials Space, but it is Neighborhood ordering that produced the most distinct clumping in both spaces. 5.4 Concluding Remarks This initial study suggests several recommendations and items for future study. First, our findings are consistent those of Tsatrfir et al. that the Neighborhood method is generally the most revealing algorithm. Understudied to date is the potential for using σ to enhance the results. Figures 5.11, 5.12 and 5.13 show a brief study of what happens if the results in 110 Fig. 5.8 Distance matrices for tree.dat before reordering Fig. 5.9 Distance matrices after re-ordering with Side-to-Side Fig. 5.10 Distance matrices after re-ordering with Neighborhood. σ = 10 J.E. Gubernatis 5 Data Visualization and Structure Identification Fig. 5.11 Distance matrices after re-ordering with Neighborhood. σ = 100 Fig. 5.12 Distance matrices after re-ordering with Neighborhood. σ = 200 Fig. 5.13 Distance matrices after re-ordering with Neighborhood. σ = 300 111 112 J.E. Gubernatis Fig. 5.10 were extended from σ = 10 to σ = 100, 200, and 300. The changes are mainly exposing more structure in Features Space.1 In general, finding materials clumping in Features Space for the tree.dat was the reason various modifications of the Tsafrir et al. algorithms were attempted and several Monte Carlo optimization methods were explored. Instead of using σ as an annealing parameter, suggested by them, one could consider using it as a tempering parameter: Parallel tempering is generally a more effective Monte Carlo minimization scheme than simulated or quantum annealing. More effective still are the recently proposed partial and infinite swapping methods [6, 7]. The upfront question first needing an answer is, How good of a solution is needed for the intended applications? At this writing, the answer to this question is unestablished. The parameter σ likely has as more immediate use in setting length scales. The differences in Features Space between Figs. 5.10 and 5.11 illustrate this. Years ago, the connection between a data clustering algorithm and a first-order phase transition was noted [8]. Several physics-based algorithms have exploited this fact to develop successful data clustering algorithms [9, 10]. The algorithms of Tsafrir et al. in a sense, are part of this alternative perspective. In a first-order phase transition, clustering (strong correlations) among interacting particles occurs at various length scales that are the consequences of the distances over which the interaction between particles are attractive or repulsive. The correlations become stronger as the temperature is lowered towards the transition temperature. σ is an analog to the temperature: Varying it here varies a length length scale in the matrix W . Distinguishing the physics-based algorithms from standard machine learning algorithms is the presence of several length scales as opposed to none. Curiously, a seminal paper [11] on the k-means clustering algorithm, one of the most popular machine learning clustering algorithms, proposed a “grouping” algorithm that had two length scales, one for refinement and one for coarsening. Refinement increases the “attraction” of data to a particular mean and coarsening provides a “repulsion” from it. This suggestion captures the “physics” of a clustering method and is an algorithm needing implementation. We remark that the classification and clustering problems are connected. For classification problems using algorithms that have length scales in them is likely to be highly desirable. Finding the effective scales for either type of problem for the given data is likely more important than trying a suite of machine learning algorithms to find the one that is most effective or a few that are consistent. A variety of choices for the distance matrix exist. It appears that for whichever one is used, using it with a large number of features, at least with the current choices, has the potential of “washing out” the few features that are most important. For example, the datasets studied all had the tolerance factor as one of the features. By itself, the tolerance factor is traditionally used to separate perovskites from non-perovskites and ferroelectric perovskites form non-ferroelectric perovskites. For the analyses performed here, this feature seemed to have no assertive role. 1 These differences are likely more evident if the pdf file of this report is viewed on a monitor with decent resolution than from a printed version of the report. 5 Data Visualization and Structure Identification 113 As part of the feature selection issue, we suggest the following: Clustering and classification methods start with data normalized relative to some fictitious material that has the average features of the given dataset. The majority, not necessarily the optimal, determines the average even though we are seeking materials that lie outside the range of the average. Clustering takes the additional steps of scaling the data to homogenize the range. The ideal perovskite is SrTiO3 in the sense that it has nearly the ideal cubic crystal structure, but it is not a ferroelectric. PbTiO3 , which has less than the ideal crystal structure, in another sense is the ideal, except for having Pb, because it an excellent ferroelectric. It seems that in contrast to current machine learning clustering or classification schemes that define things relative to some average, we would want schemes that define things close to PbTiO3 but in a direction that points away from SrTiO3 . It is unclear whether such schemes exist. On the other hand, within the existing visualization/clustering scheme, one can at least start with the data centered relative to PbTiO3 and then query the results for those cases that are also far from SrTiO3 . Generally, it is Features Space in which we want to work as we want to associate new materials with experimentally accessible features. Materials Space reveals features that are close. In some cases working in this space might reveal redundant features, that is, it might provide a means for feature reduction. This work was supported by the Department of Energy’s Laboratory Directed Research and Development Program. References 1. P.V. Balachandran, S.R. Broderick, K. Rajan, Proc. R. Soc. A (2010). doi:10.1098/rspa.2010. 0543 2. D. Tsafrir el al., Bioinformatics 21, 2301 (2005) 3. D. Filippova, A. Gagni, C. Kingsford, BMC Bioinformatics 13, 276 (2012) 4. M. Neuditschko, M.S. Khatkar, H.W. Raadsma, PLOS ONE 7, e48375 (2012) 5. http://en.wikipedia.org/wiki/Hungarian_algorithm 6. N. Plattner et al., J. Chem. Phys. 135, 134111 (2011) 7. P. Dupuis et al., Multiscale Model Simul. 10, 986 (2012) 8. K. Rose et al., Phys. Rev. Lett. 65, 945 (1990) 9. M. Blatt et al., Phys. Rev. Lett. 76, 3251 (1996) 10. P. Ronhovede, Z. Nussinov, Phys. Rev. E 81, 046114 (2010) 11. J. B. McQueen, Some methods for classification and analysis of multivariate data. in Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, (University of California Press, Berkeley, 1967). p. 281 Chapter 6 Inference of Hidden Structures in Complex Physical Systems by Multi-scale Clustering Z. Nussinov, P. Ronhovde, Dandan Hu, S. Chakrabarty, Bo Sun, Nicholas A. Mauro and Kisor K. Sahu Abstract We survey the application of a relatively new branch of statistical physics— “community detection”—to data mining. In particular, we focus on the diagnosis of materials and automated image segmentation. Community detection describes the quest of partitioning a complex system involving many elements into optimally decoupled subsets or communities of such elements. We review a multiresolution variant which is used to ascertain structures at different spatial and temporal scales. Significant patterns are obtained by examining the correlations between different independent solvers. Similar to other combinatorial optimization problems in the NP Z. Nussinov (B) · B. Sun · D. Hu Washington University in St. Louis, St. Louis, MO 63130, USA e-mail: zohar@wuphys.wustl.edu B. Sun e-mail: bosun@wustl.edu D. Hu e-mail: dan1226@gmail.com Z. Nussinov Department of Condensed Matter Physics, Weizmann Institute of Science, 76100 Rehovot, Israel P. Ronhovde Findlay University, Findlay, OH 45840, USA e-mail: ronhovde@findlay.edu S. Chakrabarty Department of Physics, Indian Institute of Science, Bangalore 560012, India e-mail: schakrab@go.wustl.edu N.A. Mauro North Central College, Naperville, IL 60540, USA e-mail: Nicholas.mauro@gmail.com K.K. Sahu School of Minerals, Metallurgical and Materials Engineering, Indian Institute of Technology, Bhubaneswar 751007, India e-mail: kis.sahu@gmail.com © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_6 115 116 Z. Nussinov et al. complexity class, community detection exhibits several phases. Typically, illuminating orders are revealed by choosing parameters that lead to extremal information theory correlations. 6.1 The General Problem A basic question that we wish to discuss in this work is whether machine learning and data mining tools may be applied to the analysis of material properties. Specifically, we will review initial efforts to detect, via statistical mechanics and the tools of information science and network analysis, pertinent structures on all scales in general complex systems. We will describe mapping atomic and other configurations onto graphs. As we will explain, patterns found in these graphs via statistical physics methods may inform us about the structure of the investigated materials. These structures can appear on multiple spatial and temporal scales. In comparison to standard procedures, the advantage of such an approach may be significant. There are numerous classes of complex systems. One prototypical variety is that of glass forming liquids. “Glasses” have been analyzed with disparate tools [1–16]. Although they have been known for millennia, structural glasses still remain ill understood. It is just over eighty years since the publication of one of the most famous papers concerning the structure of glasses [2]. Much has been learned since the early days of hand-built plastic models and drawings, yet basic questions persist. Amorphous systems such as glasses strongly contrast with idealized simple solids. In simple crystals, the structure of an atomic unit cell is replicated to span the entire system. Long before scattering and tunneling technologies, prominent figures such as Robert Hooke, Christiaan Huygens, and their contemporaries in the 17th century proposed the existence of sharp facets in single crystals results from recurrent fundamental unit cell configurations. The many years since have seen numerous breakthroughs (including the advent of quantum mechanics and atomic physics) and witnessed a remarkable understanding as to how the quintessential simple periodic structure of crystals accounts for many of their properties. However, while simple solids form a fundamental pillar of current technology (e.g., the transistor whose invention was made possible by an understanding of the electronic properties of nicely ordered periodic crystals and chemical substitution therein), there are many other complex systems whose understanding is extremely important yet still lacking. The discovery of salient features of these materials across all scales is important for both applied and basic science. The recognized significance of this problem engendered the Materials Genome Initiative [17]—a broad effort to develop infrastructure for accelerating materials innovation. This work discusses a path towards solving this problem in complex amorphous materials. The framework that we will principally suggest is that of multi scale community detection. This approach does not invoke assumptions as to which system properties are important and construct resulting minimal toy models based on the assumptions. The insightful guess-work that is typically required to describe complex 6 Inference of Hidden Structures in Complex … 117 materials is, in the work that we review, replaced by a computerized variant of the wisdom of the crowds phenomena [18]. The key concepts underlying this approach may be applied to general hard problems beyond those concerning the structure of materials or even general data mining. In the next section, we review an “Information theoretic ensemble minimization” method that may be suited for such tasks. 6.2 Ensemble Minimization Before delving into complex material and network analysis, we first discuss a general strategy for solving hard problems. The concept underlying this approach is perhaps best conveyed by a simple cartoon such as that sketched in Fig. 6.1a. In this illustration, each sphere corresponds to an individual solver (or “replica”) that explores an energy landscape. On its own, each such sphere might get stuck in a local energy minimum. The collective ensemble of solvers may, however, thwart such situations more readily as compared to the same single solver algorithm [21]. In Fig. 6.1b, the individual solvers not only roam the energy landscape but also interact amongst themselves as schematically denoted by springs. If a single solver gets stuck in a Fig. 6.1 The spheres in panel (a) of the figure depict solvers (or “replicas”) independently navigating the energy landscape defined by (6.2). Strong correlations among the replicas indicate stable, well-defined partition. We evaluate agreement among all replica pairs using the information correlations (Sect. 6.4). In panel (b), interactions between the replicas assist the ensemble in finding optimal low energy states (a) (b) 118 Z. Nussinov et al. false minimum, the other solvers may “pull it out” and explore broader regions of the energy landscape. This collective evolution of individual solvers is quite natural and has appeared in different guises across many fields. It anthropological contexts, this basic principle is known as “wisdom of the crowds” [18]. That is, the crowd or ensemble of individuals might do far better than a single solver. Unlike ensemble related approaches such as swarm intelligence [22] or genetic [23] algorithms, relevant problems in our context do not focus exclusively on minimizing a given energy function. Rather, we will try to maximize information theory correlations [the effect of the springs in Fig. 6.1b] while simultaneously minimizing a cost function [20]. If all (or many) solvers agree on a particular candidate solution then that solution may naturally arise in many instances and may be of the high importance regardless of whether or not it is the absolute minimum of the energy. In the physical problems that we will consider—that of finding natural structures in materials—these considerations are pertinent. The above discussion is admittedly abstract and may, in principle, pertain to any general problem. We next briefly explain the basic mathematical framework—the community detection problem—in which we will later couch the material structure detection endeavor. 6.3 Community Detection and Data Mining Community detection pertains to the quest of partitioning a given graph or network into its optimally decoupled subgraphs (or so-called communities), e.g., [24–37]. As the reader may anticipate, given the omnipresence of networks and the generality of this task, this problem appears in disparate arenas including biological systems, computer science, homeland security, and countless others. In what follows, we introduce some of the key elements of community detection. The graphs of interest will be composed of nodes where a node is a fundamental element of an abstracted graph. An edge in the graph is a defined relationship between two nodes. Edges may be weighted or unweighted where the unweighted case is the one most commonly examined. In our applications, we will need to assign weights to the edges in the graph as we will describe. Similarly, in general applications, edges may be either symmetric or directed. Now we come to a basic ingredient of community detection. A community corresponds to a subset of nodes that are more cohesively linked (or densely connected for unweighted edges) within their own community than they are to other communities. The above definition might seem a bit loose. Indeed, there are numerous formulations of community detection in the literature. As intuitively one may expect, most of these do, more or less, the same thing. When clear community detection solutions exist, all algorithms quantify the structure of large complex networks in terms of the smaller number of the natural cohesive components. Rather general data structures may be cast in terms of abstract networks. Thus, the community detection problem and other network analysis methods can have direct implications across multiple fields. Indeed, we will elaborate how this occurs for image segmentation and material analysis. 6 Inference of Hidden Structures in Complex … 119 Fig. 6.2 A small network partition where individual communities are represented by different node shapes and colors. “Friendly” or “cooperative” relations are depicted by solid, black lines. These are modeled as ferromagnetic interactions in (6.2). “Missing” or “undefined” relations work to break up well-defined communities, so they are modeled with anti-ferromagnetic interactions, meaning they are repulsive in terms of their energy contributions. The physical energy model trivially extends to more general relations including weighted and adversarial relations (not depicted here) In what follows we will briefly review the rudiments of an “Absolute Potts Model” method for community detection [19] that avoids a “resolution limit” that an insightful earlier Potts model [38] exhibited. To cast things generally, we make a simple observation underlying the “Potts” characterization. Any partition of the numbered nodes i = 1, 2, 3, . . . , N into q different communities (the ultimate objective of any community detection algorithm) is an assignment i → σi where the integer 1 ≤ σi ≤ q denotes the community number to which node i belongs. With a characterization {σi } in hand, we next construct an energy functional. To illustrate the basic premise, we first consider an unweighted graph—one in which the link strength Aij between the two nodes i and j is Aij = 1 if an edge is present between the two nodes and Aij = 0 if there is no link. As Fig. 6.2 demonstrates, for each pair of nodes there are four principal cases to consider. That is, either (i) the two nodes belong to the same community and have an “attraction” between them (i.e., Aij = 1), (ii) two nodes in the same community can have a missing link between them ( Aij = 0), (iii) the two nodes may belong to different communities yet nevertheless exhibit cohesion between themselves ( Aij = 1), or (iv) nodes i and j may belong to different communities and have no edge connecting them ( Aij = 0). Situations (i) and (iv) agree with the intuitive expectation that nodes in the same community should be connected to one another while those in different communities ought to be disjoint. We may take these four possibilities as the foundation of an energy function. That is, any given pair of nodes may be examined to see which of these categories it belongs to. Thus, a contending cost function is given by the Potts model Hamiltonian H =− 1 [Aij δ(σi , σ j ) + γ (1 − Aij )(1 − δ(σi , σ j ))]. 2 i= j (6.1) 120 Z. Nussinov et al. In (6.1), δ(σi , σ j ) is a Kroncker delta (i.e., δ(σi = σ j ) = 1, δ(σi = σ j ) = 0) and γ is a “resolution parameter” that will play a notable role in our analysis. Before turning to the origin of the name of this parameter, we observe that, subtracting an innocuous additive constant, (6.1) is trivially H =− 1 [Aij − γ (1 − Aij )]δ(σi , σ j ). 2 i= j (6.2) As (6.2) makes clear, by virtue of the Kronecker delta δ(σi , σ j ), the sum is local—i.e., the sum only includes intra-community node pairs. The Hamiltonian of (6.2) may be minimized by a host of methods. In practice, when the solution of the problem is easy to find, nearly all viable approaches will yield the same answer. Amongst many others, two approaches are afforded by spectral methods [in which the discrete Potts model spins are effectively replaced by continuous spherical model (or large n) spins] and a conceptually more primitive steepest descent type approach. A simple incarnation of the relatively successful greedy algorithm [19, 20] that extends certain ideas introduced in [29] is given by the following steps: (a) Initially, each node forms its own community [i.e., if there are N (numbered) nodes then there will be q = N communities]. (b) A node (whose number is i 1 ) is chosen stochastically and then another edge sharing node i  is picked at random. (c) If it is energetically profitable to move the node i  together into the group formed by i 1 then this is done (otherwise community assignments are unchanged). (d) Yet another node i 2 is next chosen and once again it is asked whether moving yet another node into the community of i 2 lowers the energy. As earlier mentioned, if this change lowers the energy of (6.2), the nodes will be merged. Otherwise no change will be made. (e) In this manner, we cycle through each of the N nodes and repeat as necessary. (f) The process stops and a candidate partition is found once all further possible mergers do not lower the energy further. As the reader can appreciate, such a simple algorithm lowers the energy until the system becomes trapped in a local minimum. To improve the accuracy (i.e., further lower the energy of candidate solutions), one may repeat the above steps a finite number of times for a finite number of trials—i.e., repeat the above when vertices i 1 , i 2 , . . . , i N are chosen in a different random order to see if a lower energy solution may result. For the wide range of examined problems, the number of trials for each replica of the system is typically on the order of ten or smaller. When approaching the “hard phase” (to be discussed in Sect. 6.6) with multiple false minima, an increase in the number of trials may likely further increase the accuracy (this rise in the accuracy was termed the “computational susceptibility” in [20, 61]). Typically, elsewhere the improvement in the precision due to a further increase in the number of trials is nearly nonexistent (see, e.g., Fig. 13 in [20]). Further embellishments of the bare algorithm outlined above, include the acceptance of zero energy moves and other refinements [19]. Other illuminating greedy type approaches for the inference of community structure have been advanced, e.g., [39]. 6 Inference of Hidden Structures in Complex … 121 6.4 Multi-scale Community Detection We now turn to “multi-scale” community detection, e.g., [20, 40–45]. In certain notable approaches, e.g., [45], detection of scale is performed without the resolution parameter but rather by examining the effects of thermal fluctuations in a pure ferromagnetic system (one sans the antiferromagnetic interaction present in the second term of (6.2)), and other considerations elsewhere. In what follows, we will build on the ideas introduced in Sect. 6.3 that lead to an accurate determination of structure on diverse pertinent scales. To understand the physical content of the resolution parameter (and the origin of its name) in (6.2), we consider several trivial limits. First, we focus on the case of γ = 0. In such a situation, the energy of (6.2) is minimized when all nodes belong to a single community. This is the lowest energy solution since each intra-community link lowers the energy [the first term of (6.2)], but there is no energy penalty from any missing links between nodes in the same community since the second term in (6.2) is trivially zero. Thus, in order to maximize the number of internal links it is profitable to assign all nodes to the same community. In the diametrically opposite limit—that of γ → ∞, the energy penalty diverges unless every pair of nodes belonging to the same community share a link. Thus, in this limit, the lowest energy states are those in which the system fragments into (typically) a large number of communities where each node is connected to all other nodes in its community. That is, the communities are “perfect cliques.” As γ is monotonically increased from zero, the ground states of (6.2) lead to communities that veer from the extreme global case (γ = 0) to the limit of many disparate densely internally connected local communities (γ → ∞). Putting all of the pieces together, the reader can see why γ is inherently related to the intra-community edge density and thus is indeed a “resolution parameter”. At this stage, it is not yet clear which values γ should be assigned in order to lead to the most physically pertinent solutions. The non-uniqueness of γ is, actually, a virtue of the Potts model based approach of (6.2). That is, in general, there may be several relevant resolution scales that lead to different insightful candidate low energy partitions of this Hamiltonian. This is the situation which is schematically depicted in Fig. 6.3 for a synthetic system that exhibits a hierarchical structure. In such cases as γ is increased, the minima of (6.2) unveil different resolutions in the hierarchy. In practice, the multi-resolution community-detection method [20] systematically infers the pertinent scale(s) by information-theory-based correlations [46–49] between different independent solvers (or “replicas”, as discussed in Sect. 6.2) of the same community detection problem. In most studied systems, the number of replicas used is s ≤ 12. As alluded to in Sect. 6.3, the lowest energy solution amongst a fixed number of trials is taken for each of the individual replicas. If these solvers (i.e., the replicas) strongly concur with each other about local or global features of the solution [20], then these aspects are likely to be correct. Such an agreement between solvers is manifest in the information correlations. Information theory extrema [50–52] then provide all relevant system scales. 122 Z. Nussinov et al. Fig. 6.3 A partition of a synthetic network with 256 nodes having three resolution levels [19]. The random edge density (fraction of edges connecting pairs of points in different communities) is 10 % on the global scale. At increasing resolution there are five groups with an inter-community edge density of 30 %. At the highest resolution, these five groups are further split into small sub clusters (16 in total) each having an internal edge density of 90 %. As described in Sect. 6.4, a multi-resolution algorithm may identify different categories of partitions in hierarchical systems. See Fig. 6.4 for a demonstration of how the multiresolution algorithm accurately isolates both levels of the hierarchy Figure 6.4 shows the results of our analysis as the resolution parameter γ is varied for the synthetic system of Fig. 6.3. Plotted are three information theory correlations between replicas—the average inter-replica variation of information (VI), the mutual information (I), the normalized mutual information (NMI), the total number of communities (q) found for different values of γ , and the Shannon entropy (H ) averaged over different replicas. Transitions between viable solutions are evident as jumps in the number of communities q and, most notably, as transitions between crisp information theory measure plateaux. As shown, each of the plateaux in Fig. 6.4 corresponds to a different level of the hierarchy of the synthetic network in Fig. 6.3. Similar to our discussion in Sect. 6.3, in practice the replicas differ from one another in the order in which consecutive vertices are picked and moved so as to minimize the energy of (6.2). Thus, for any given problem has an ensemble of very similar (or nearly identical) viable solutions associated with it. A detailed summary of this approach appears in [20]. In accord with the above explanation, as γ is increased, the associated candidate energy minima partition the system into more local, smaller communities (deeper levels of the hierarchy). The inter-replica information theory correlations further afford a measure of the quality of the viable partitions. High NMI values (i.e., of size close to unity) indicate solutions that are likely to be pertinent. In the spirit of Sect. 6.2, if the different replicas all agree with one another on a putative partition, then that partition is likely to be physically meaningful. The variation of information measures the disparity between candidate solutions; thus the VI values are high between different NMI plateaux and are low within the NMI plateaux. 6 Inference of Hidden Structures in Complex … 123 (iia) 5 60 50 4 (ia) IN 3 40 0.4 IN 0.2 I q 0.0 10 10 γ 10 (iib) 2 (ib) 1 0 10 -1 10 γ 0 20 1 10 0 0 5 50 4 40 3 30 1 V H q 4 3 V 0 30 H (b) -1 2 10 q 0.6 70 I 0.8 6 q (a) 1.0 2 20 1 10 0 0 1 Fig. 6.4 Information theoretic and other metrics of the multiresolution algorithm in Sect. 6.4 as applied to the synthetic partition depicted in Fig. 6.3 [20]. In the top panel, the average interreplica normalized mutual information (I N ), (un-normalized) mutual information (I ), and number of clusters (or communities) q are plotted as a function of the resolution parameter γ . In the bottom panel, the Shannon entropy (H ) and the average inter-replica variation of information (V ) are further provided. As described in the text, stable partitions lead to plateaux (or more general local extrema) in the inter-replica information theory and other correlations as a function of the resolution parameter. Two such candidate resolutions (marked (i) and (ii)) are seen in both panels (a) and (b). These plateaux show how the multiresolution algorithm may isolate both level 2 (superclusters) and level 3 (smallest clusters) of the hierarchy of Fig. 6.3 6.5 Image Segmentation Our goal is to identify structure in materials, but before turning to this endeavor, we first illustrate how patterns may, literally, be revealed by community detection. The ideas underlying this objective will elucidate our approach to material genomics. The aim of image segmentation [52–58] is to divide a given digital image into separate objects (or segments) based on visual characteristics. Two somewhat challenging examples are provided in Fig. 6.5 [59, 60]. To transform the problem into that of community detection, we map a digital image into a network as follows. Each pixel in an image is regarded as a node in a graph. (2) The edge weights between nodes in the graph are determined by the degree 124 Z. Nussinov et al. Fig. 6.5 Examples of the image segmentation challenges [59, 60]. Left the left image is that of zebra with the a similar “stripe” background. Right the image on the right is that of a dalmatian dog. Most people do not initially recognize the dog before given clues as to its presence. Once the dog is seen it is nearly impossible to perceive the image in a meaningless way of similarity between the additive color RGB (i.e., the Red, Green, and Blue) strength of individual pixels or, more generally, of finite size boxes geometrically centered about a given pixel. The bare edge strengths may be embellished and replaced by weights set by the Fourier weights associated with finite size blocks about a given node. Alternatively, we can use exponential weighting of the inter-node edge strength based on the geometric distance between them (the distance between the centers of the finite size blocks about them) [52]. The edge value assignment is such that if two pixels i and j (or boxes centered about them) have similar RGB values (or absolute Fourier magnitudes), then a function Vij set by these differences will be small. Analogously, if nodes i and j (or boxes centered around them) are dissimilar then Vij will become large. With such functions Vij at hand, a simple generalization of (6.2) is given by H= q  1   (Vij − V )[Θ(V − Vij ) + γ Θ(Vij − V )] . 2 s=1 i, j∈C (6.3) s Here, Θ(z) is the Heavyside function (Θ(x > 0) = 1 and Θ(x < 0) = 0) and V is an adjustable background value. As the astute reader undoubtedly noticed, the locality constraint imposed by the Kronecker delta in (6.2) has been made explicit in (6.3) by having only intra-community sums for each of the q communities {Cs }. Details of the construction of the weights Vij are provided in [52]. Following our more colloquial description here, there are four or five adjustable parameters in (6.3): the resolution parameter γ , the background value V , the block size L centered about each pixel (or more general rectangular blocks of size L x × L y ), and the pixel distance  over 6 Inference of Hidden Structures in Complex … 125 Fig. 6.6 The application of multiresolution algorithm for the segmentation of the zebra and dalmatian dog images of Fig. 6.5. The results correspond to typical partitions found with the optimal parameter set. The first and the second rows contain “camouflages” of a similar stype. We are able to detect the boundary of the zebra and discern the body and hind legs of the dog albeit with some “bleeding” [52] which the pixel interconnection function Vij decays. Once these are set, the earlier community detection algorithm of Sect. 6.3 may be applied. The determination of the optimal value(s) of these parameters may be performed using the same procedure outlined in Sect. 6.4. While systems such as the synthetic hierarchical network of Fig. 6.4 exhibit well defined plateaux in the information theory and other measures, we found more generally that the optimal values of parameters z correspond to local extrema whereby variations in the parameters do not alter the outcome. That is, if Q is a measured quantity of interest (e.g., information theory correlations, Shannon entropy, the energy associated with the given Hamiltonian) then optimal parameters z are found by the requirement that ∇z Q = 0. These may lead to multiple viable solutions corresponding to very different meaningful partitions. In practice, we found that in all but the hardest cases, meaningful solutions are found when arbitrarily setting all parameters to a fixed value and that, similar to Sect. 6.4, the multi-scale solutions may be found by only varying the resolution parameter γ . The results of our method are given in Fig. 6.6; these correspond to typical partitions found with the optimal parameter set. The above image analysis ideas may be applied for the detection of the primitive cells in simple Bravais lattices, the inference of domain walls in spin systems, and hierarchical structures in quasicrystals [52]. For a complete classification of contending partitions and, most notably, a deeper understanding of whether the found solutions are meaningful or not, it is useful to survey the canonical finite temperature phase diagram associated with (6.3) when all of the above parameters, including temperature, are varied. In the current context, by “temperature”, we allude to the finite temperature study of the 126 Z. Nussinov et al. Hamiltonian of (6.2) either analytically or via a thermal bath associated with, e.g., the acceptance of the moves in the algorithm outlined at the end of Sect. 6.3 [50, 52, 61–63]. 6.6 Community Detection Phase Diagram As the bare edge weights and additional parameters setting the values of Vij in the Hamiltonian of (6.3) and temperature are modified, quantities such as the system energy, Shannon entropy, the number of communities, and information theory correlations amongst the found ground states generally attest to the presence of multiple phases. Additional metrics including the “computational susceptibility” (the change in the average inter-replica NMI as the number of trials, see Sect. 6.3, is increased [20, 61, 62]), the time required for convergence (when attainable), and the ergodic/nonergodic character (“chaotic” type feature) of the dynamics all delineate the very same phase diagram boundaries inferred from each of the examined quantities. Information theory measures have been used to study other specific interesting systems, e.g., [64]. The observed phases in the community detection problem naturally extend to finite temperatures (T ) when the analysis of the system defined by the Hamiltonian of (6.3) is broadened to include positive temperatures. Finite size systems such as the real networks and images that we discuss cannot exhibit thermodynamic phase transitions and all finite temperature functions are analytic. Nevertheless, practically, sharp changes appear as temperature and other parameters are varied. Similar to other NP hard [65] combinatorial optimization problems [66–68], three prototypical phases were established in general community detection problems with a distribution of varying community sizes [61]. Subsequently, these have been beautifully explored in depth in several specific graph types—most notably the so-called “stochastic block models”, in which a graph has equal size communities e.g., [69– 72] and in other penetrating works, e.g., [73–75]. Earlier signatures of a bona fide transition in stochastic block and power law distributed models [19, 20] and limits on detectability in the stochastic block model via the cavity approximation were suggested [76]. To intuitively highlight the essential character of the prototypical phases with a minimum of jargon, we will colloquially term these the “easily solvable”, the “solvable hard”, and the “unsolvable” phases. In realistic finite yet very large scale systems [62, 63] various results can be established and these may be further examined in various limits. Of course, bona fide transitions formally occur only in the thermodynamic limit. A trivial behavior results in infinite size graphs when the average number of nodes per community is of finite size [62, 63]. As one would expect, typically all community detection problems are either solvable or unsolvable. In NP hard problems, the solvable phase splinters into an “easy” and a “hard phase”. When the edge weights set by Vij are associated with sharp community detection partitions, then finding a natural solution is rather trivial (and nearly all algorithms, not only the Potts model described here, will readily unearth such an answer). On the other hand, if the couplings Vij are sufficiently 6 Inference of Hidden Structures in Complex … 127 Fig. 6.7 The normalized mutual information I N as the function of the resolution log(γ ) and temperature T for the “bird” image in the upper lefthand panel of Fig. 6.9. We mark the “easy” phase (where I N is almost 1 as “A”, the “hard” phase where I N decreases as “B”, the “unsolvable” phase where I N forms a plateau whose value is less than 1 as “C”. The “easy-hard-unsolvable” phases will be further confirmed by the corresponding image segmentation results in Fig. 6.9, as these appear, respectively, in panels A, B, and C therein) “noisy” so as to be of, effectively, equally the same strength for edges between nodes in the same putative community as for edges linking nodes belonging to different supposed communities, then no well defined community detection solutions exist. Similarly, at sufficiently high temperatures, in most cases, all traces of structures found in the ground state(s) are lost. The most common variant of the community detection problem has been proven to be NP complete [33]. As in disparate NP problems [68], it was found that in broad classes of the community deception problem (and in its image segmentation variant) [52, 61–63, 69, 71, 73, 75], lying between the extremities of the “easy” and “unsolvable” phases there often exists a “hard phase”; in this phase, solutions exist, but due to the plethora of competing states, they may be extremely hard to find. Information theory measures may be used to delineate phase boundaries [52, 61–63]. Using information theory correlations and the global Shannon entropy, we show, in Figs. 6.7 and 6.8 respectively, the phase diagram associated with the image shown in the upper lefthand side of Fig. 6.9. In the solvable phase(s), typically, all partitions produced by parameters that lie in the same basin, lead to qualitatively similar results. Moderate temperature and/or disorder can lead to order by disorder or annealing effects (similar to those found in other systems, e.g., [77–81]). However, at sufficiently high temperatures and/or the introduction of noise about the initial Vij values, the system will be in the unsolvable phase. By carefully studying the system phase diagram and the character 128 Z. Nussinov et al. Fig. 6.8 The Shannon entropy H as the function of the resolution log(γ ) and the temperature T for the “bird” image in the upper lefthand panel of Fig. 6.9. The signatures of the three phases “easy”, “hard” and “unsolvable” are easily detected in this phase diagram and agree with those ascertained via the normalized mutual information of Fig. 6.7 and magnitude of the information theory overlaps or thermodynamic functions such as the internal energy and entropy as well as the dynamics, one may assess whether the perceived community detection solutions may be meaningful. When applied to image segmentation, the consistency of this procedure may be inspected visually and intuitively judged sans complicated analysis. 6.7 Casting Complex Materials and Physical Systems as Networks With all of the above preliminaries, we now finally turn to the ultimate data mining objective of this work: that of the important detection of spatial and temporal structure in complex materials and other systems [50, 51, 82–87]. This problem shares a common conceptual goal with image segmentation yet is, generally, far more daunting for human examination. Similar to the analysis presented thus far, the approach that we wish to discuss casts physical systems as graphs in space or space-time and then employs the above discussed multi-scale community detection to determine meaningful partitions. 6 Inference of Hidden Structures in Complex … 129 Fig. 6.9 The image segmentation results of the “bird” image. The original image is on the upper left. The other images denoted as “A”, “B”, and “C” correspond to the image segmentation results with different parameter pairs (log(γ ), T ) marked in *. Both result A and B are able to distinguish the bird from the “background”. However in panel B, the bird is composed of lots of small clusters. Result C is unable to detect the bird. Thus, the results shown here demonstrate the corresponding “easy-hard-unsolvable” phases in the phase diagram in Figs. 6.7 and 6.8. From [52] In this case, nodes in the graph code basic physical units of interest (e.g., atoms, electrons, etc.). Multi-particle interactions or experimentally measured correlations in the physical system are then ascribed to edge weights Vij between the nodes (for two-particle interactions or experimentally measured pair correlations [50, 51]), or to three-node triangular weights (for three-particle interactions or correlations) Vijk , and so on. Given these static or time-dependent weights, the graph is then (similar to the discussion in earlier sections) partitioned into “communities” of nodes (e.g., clusters of atoms) that are more tightly linked to or correlated with each other than with nodes in other clusters [19]. As in the earlier examples explored in this work, information theory based multi-scale community detection provides both local structural scales (e.g., primitive lattice cell, nearest neighbor distance, etc.) as well as global scales (such as correlation lengths) and any other additional intermediate scales if and when these are present. The results of this approach for a two-dimensional Lennard-Jones system with vacancies are shown in Fig. 6.10. When the edge weights between nodes are set equal to the Lennard-Jones strength associated with the distance between them, the multiscale community detection algorithm recognizes both the typical triangular unit cells as well as larger scale domains (communities) in which the vacancy defects tend, on average, to lie on their boundaries. Partitions in which defects tend to aggregate 130 Z. Nussinov et al. Fig. 6.10 A diluted two-dimensional Lennard Jones system with edge weight set equal to the pair interaction energies. The ground state of a two dimensional Lennard Jones model is that of a triangular lattice in which the lattice spacing is equal to the distance at which the Lennard Jones potential attains its minimum. In this figure, the triangular lattice is diluted by introducing defects in the form static vacancies (denoted by white holes). The found community boundaries are intuitively relegated defects lying on the periphery of these domains [50] at the domain boundaries is consistent with general expectations for stable domains and is intuitively appealing. As the reader may envisage, the community detection method may be extended to general many-body systems with different types of species (e.g., disparate ion types in metallic glass formers [50, 51]). One example is depicted in Figs. 6.11, 6.12, and 6.13 corresponding to a ternary system of Al88 Y7 Fe5 based on a molecular dynamics simulation of 1600 atoms in which edge weights were set by pair potentials is provided in Figs. 6.12 and 6.13. As seen in the partition of Fig. 6.13, for which the inter-replica information theory were extremal and which lies in the solvable phase, below the liquidus temperature (the temperature at which the system is an equilibrium liquid), large clusters were detected. Along similar lines, clusters may be identified across many problems. In Fig. 6.16 we show typical clusters found in a Kob-Andersen binary system. While for human analysis the complexity of potentially identifying pertinent clusters may grow dramatically with the number of atom types, for the mutli-resolution analysis there is no such increase (Figs. 6.14 and 6.15). 6 Inference of Hidden Structures in Complex … Fig. 6.11 From [50, 51]. In order to apply the algorithm in Sect. 6.4 to complex physical systems, we may generally define two types of replica sets. Panel (a) depicts a few nodes as they appear for a static system—i.e., one with no time separation between simulation replicas. Panel (b) depicts a similar set of replicas with each separated by a successive amount of simulation time t. In either case, we then generate the replica networks using the potential energy between the atoms as the respective edge weights in the network. Consequently, we minimize (6.1) using a range of γ values in the algorithm described in Sect. 6.4 131 (a) Time independent replicas (b) Time dependent replicas Fig. 6.12 From [50]. A static snapshot from a molecular dynamics simulation of Al88 Y7 Fe5 system of 1600 atoms that has been quenched from an initial temperature of 1500 to 300 K and then allowed to partially equilibrate. The atoms are Y, Al, and Fe, respectively, in order of increasing diameters. In this figure, the atoms are color coded—Fe atoms are red and Y atoms are marked green 132 Z. Nussinov et al. Fig. 6.13 The figure shows a static partition of Fig. 6.12. Here, different clusters are identified by individual colors. It is also possible to incorporate overlapping nodes in neighboring clusters to account for the possibility of multiple cluster memberships per node, yielding an interlocking system of clusters [50] In a similar manner, the edge weights can be set by experimentally measured pair correlations. In [50], atomic configurations consistent with the experimentally determined scattering data for quenched Zr80 Pt20 [3–6] were generated [50, 51] using Reverse Monte Carlo methods [7, 8]. At low temperatures, typically the found structures in all of these cases are far larger than local patterns probed for and detected by current methods [88–92]. Fourpoint correlations have long been employed to ascertain spatio-temporal scales and the quantify “dynamical heterogeneities”, e.g., [91, 93]. A long-standing challenge is the identification of structures of general character and scale in amorphous systems. There is, in fact, a proof that as supercooled liquid falls out of equilibrium to become an amorphous, there must be an accompanying divergent length scale [94]. Methods of characterizing local structures [9–12] center on a given atom or link; as such, they are restricted from detecting general structures. Because of the lack of a simple crystalline reference, the structure of glasses is notoriously difficult to quantify beyond the very local scales. In [50–52], graph weights were determined empirically (potentials in a model system, experimentally measured partial pair density correlations in supercooled fluids, or pixels in a given image)—no theoretical input was invoked as to what the important scales should be or if an exotic order parameter may be concocted. Similarly, in a time dependent analysis for dynamically evolving systems, by employing replicas at different time slices as well as regarding the system as a higher 6 Inference of Hidden Structures in Complex … 1.0 (a) 4 2.0 3 1.5 NMI I q 0.2 0.0 -1 10 γ 0 10 1 1.0 1 0.5 0 0.0 2 2 0.4 (c) 10 q 0.6 I NMI 0.8 133 10 Fe Al 3 1.5 2 2 1 0 2.0 -1 10 γ 0 10 1 1.0 1 0.5 0 0.0 Y 2 VI 3 4 10 q VI H q H 4 (b) 10 Fig. 6.14 The result of the multiscale community detection applied to a ternary glass former at a simulation temperature of T = 300 K [50, 51]. Both panels (a) and (b) on the left depict the information theory correlations between the replicas (as described in Sect. 6.4). In panel (c), each of the communities found is assigned a different color. These structures correspond to the Normalized Mutual Information (NMI) or Variation of Information (VI) extrema. These well-defined structures contrast sharply with the lack of cohesive features in Fig. 6.15 dimensional “image” in space-time, using the inter-replica information theory correlations, spatio-temporal patterns were found and time dependent structures were quantified. In this approach, the data speak for themselves. We remark that notwithstanding the aforementioned difficulties, recently extremely large growth of static structure was observed by far simpler network analysis in certain binary metallic glasses that exhibit crisp icosahedral motifs [96]. Similar to the description above, one may likely find other motifs in other systems. The problem is that guessing and hopefully finding pertinent patterns can be extremely challenging to do by conventional analysis. 6.8 Summary In this work, we reviewed key features of a statistical-mechanics-based “community detection” approach to find pertinent features and structures (both spatial and temporal) in complex systems. In particular, we illustrated how this method may be applied to image segmentation and the analysis of amorphous materials. The demand for automated data mining approaches may become more acute with ever 134 Z. Nussinov et al. NMI I q 2.0 3 1.5 0.6 2 0.4 0.2 0.0 -1 10 γ 0 10 1 I NMI 4 1.0 1 0.5 0 0.0 (c) 2 (a) 0.8 10 q 1.0 10 Fe Al 1.5 VI H q 1 -1 10 γ 0 10 1 Y 2 3 2 2 0 2.0 10 q VI 3 4 H 4 (b) 1.0 1 0.5 0 0.0 10 Fig. 6.15 The structure of same ternary glass former of Fig. 6.15 at a simulation temperature of T = 1500 K. Inter-replica information theory correlations are provided in panels (a) and (b). As is evident in panel (c), depict the corresponding lack of structure by significantly higher VI or lower NMI as compared to those in Fig. 6.14 Fig. 6.16 From [50]. A set optimal clusters found in a low temperature Kob-Andersen system [95] in which two types of atoms (color coded red and silver) appear increasingly available data on numerous complex systems. The study of complex materials may be extremely challenging to carry out by current conventional means that rely on guessed patterns, simplified models, or brute force human examination. 6 Inference of Hidden Structures in Complex … 135 Acknowledgments We have benefited from interactions with numerous colleagues. In particular, we would like to thank S. Achilefu, S. Bloch, R. Darst, S. Fortunato, V. Gudkov, K.F. Kelton, T. Lookman, M.E.J. Newman, S. Nussinov, D.R. Reichman, and P. Sarder for numerous discussions and collaboration on some of the problems reviewed in this work and their outgrowths. We are further grateful to support by the NSF under Grants No. DMR-1106293 and DMR-1411229. ZN is indebted to the hospitality and support of the Feinberg foundation for visiting faculty program at the Weizmann Institute. References 1. C.A. Angell, Formation of glasses from liquids and biopolymers. Science 267(5206), 1924– 1935 (1995) 2. W.H. Zachariasen, The atomic arrangement in glass. J. Am. Chem. Soc. 54, 3841 (1932) 3. T. Nakamura, E. Matsubara, M. Sakurai, M. Kasai, A. Inoue, Y. Waseda, Structural study in amorphous Zr-noble metal (Pd, Pt and Au) alloys. J. Non-Cryst. Solids 312–314, 517 (2002) 4. J. Saida, K. Itoh, S. Sato, M. Imafuku, T. Sanada, A. Inoue, Evaluation of the local environment for nanoscale quasicrystal formation in Zr80 Pt20 glassy alloy using Voronoi analysis. J. Phys. Condens. Matter 21, 375104 (2009) 5. D.J. Sordelet, R.T. Ott, M.Z. Li, S.Y. Wang, C.Z. Want, M.F. Besser, A.C.Y. Liu, M.J. Kramer, Structure of Zrx Pt100−x (73 ≤ x ≤ 77) metallic glasses. Metall. Mater. Trans. A 39A, 1908– 1916 (2008) 6. S.Y. Wang, C.Z. Wang, M.Z. Li, L. Huang, R.T. Ott, M.J. Kramer, D.J. Sordelet, K.M. Ho, Short- and medium-range order in a Zr73 Pt27 glass: experimental and simulation studies. Phys. Rev. B 78, 184204 (2008) 7. R.L. McGreevy, Understanding liquid structures. J. Phys. Condens. Matter 3, F9 (1991) 8. D.A. Keen, R.L. McGreevy, Structural modelling of glasses using reverse Monte Carlo simulation. Nature 344, 423–5 (1990) 9. H.W. Sheng, W.K. Luo, F.M. Alamgir, J.M. Bai, E. Ma, Atomic packing and short-to-mediumrange order in metallic glasses. Nature 439, 419–425 (2006) 10. J.L. Finney, Random packings and the structure of simple liquids. I. The geometry of random close packing. Proc. R. Soc. Lond. Ser. A 319(1539), 479–493 (1970) 11. J. Dana Honeycutt, H.C. Andersen, Molecular dynamics study of melting and freezing of small Lennard-Jones clusters. J. Phys. Chem. 91, 4950–4963 (1987) 12. P.J. Steinhardt, D.R. Nelson, M. Ronchetti, Bond-orientational order in liquids and glasses. Phys. Rev. B 28, 784–805 (1983) 13. T.R. Kirkpatrick, D. Thirumalai, P.G. Wolynes, Scaling concepts for the dynamics of viscous liquids near an ideal glassy state. Phys. Rev. A 40, 1045–1054 (1989) 14. V. Lubchenko, P.G. Wolynes, Theory of structural glasses and supercooled liquids. Annu. Rev. Phys. Chem. 58, 235–266 (2007) 15. G. Tarjus, S.A. Kivelson, Z. Nussinov, P. Viot, The frustration-based approach of supercooled liquids and the glass transition: a review and critical assessment. J. Phys. Condens. Matter 17, R1143–R1182 (2005) 16. Z. Nussinov, Avoided phase transitions and glassy dynamics in geometrically frustrated systems and non-Abelian theories. Phys. Rev. B 69, 014208 (2004) 17. http://www.whitehouse.gov/mgi 18. S. James, The Wisdom of Crowds (Anchor Books, New York, 2005). ISBN: 0-385-72170-6 19. P. Ronhovde, Z. Nussinov, An improved potts model applied to community detection. Phys. Rev. E 81, 046114 (2010) 20. P. Ronhovde, Z. Nussinov, Multiresolution community detection for megascale networks by information-based replica correlations. Phys. Rev. E 80, 016109 (2009) 136 Z. Nussinov et al. 21. B. Sun, B. Leonard, P. Ronhovde, Z. Nussinov, An interacting replica approach applied to the traveling salesman problem (2014). arXiv:1406.7282.pdf 22. M. Dorigo, T. Sttzle, Ant Colony Optimization (MIT Press, Cambridge, 2004) ISBN: 0-26204219-3 23. M. Mitchell, An Introduction to Genetic Algorithms (MIT Press, Cambridge, 1996) 24. S. Fortunato, Community detection in graphs. Phys. Rep. 486, 75–174 (2010) 25. M.E.J. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 103(23), 8577–8582 (2006) 26. M.E.J. Newman, M. Girvan, Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004) 27. S. Fortunato, M. Barthelemy, Resolution limit in community detection. Proc. Natl. Acad. Sci. USA 104, 36–41 (2007) 28. A. Lancichinetti, S. Fortunato, Community detection algorithms: a comparative analysis. Phys. Rev. E 80, 056117 (2009) 29. V.D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks. J. Stat. Mech. 10, 10008 (2008) 30. M.E.J. Newman, Fast algorithm for detecting community structure in networks. Phys. Rev. E 69, 066133 (2004) 31. V. Gudkov, V. Montelaegre, S. Nussinov, Z. Nussinov, Community detection in complex networks by dynamical simplex evolution. Phys. Rev. E 78, 016113 (2008) 32. M. Rosvall, C.T. Bergstrom, Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. USA 105, 1118–1123 (2008) 33. U. Brandes, D. Dellng, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, D. Wagner, On finding graph clusterings with maximum modularity. In Graph-Theoretic Concepts in Computer Science. Lecture Notes in Computer Science (Springer, Berlin, 2007). doi:10.1007/978-3-54074839-7 34. R.K. Darst, D.R. Reichman, P. Ronhovde, Z. Nussinov, An edge density definition of overlapping and weighted graph communities (2013). arXiv:1301.3120 35. M.E.J. Newman, Spectral methods for community detection and graph partitioning. Phys. Rev. E 88, 042822 (2013) 36. M.E.J. Newman, Community detection and graph partitioning. Europhys. Lett. 103, 28003 (2013) 37. R.K. Darst, Z. Nussinov, S. Fortunato, Improving the performance of algorithms to find communities in networks. Phys. Rev. E 89, 032809 (2014) 38. J. Reichardt, S. Bornholdt, Statistical mechanics of community detection. Phys. Rev. E 74, 016110 (2006) 39. P. Tiago Piexoto, Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models., Phys. Rev. E 89, 012804 (2014) 40. J.M. Kumpula, J. Saramaki, K. Kaski, J. Kertesz, Limited resolution in complex network community detection with Potts model approach. Eur. Phys. J. B 56, 41 (2007) 41. P. Ronhovde, Z. Nussinov, Local multi resolution order in community detection. J. Stat. Mech. P01001 (2015) 42. L.G.S. Jeub, P. Balachandran, M.A. Porter, P.J. Mucha, M.W. Mahoney, Think locally, act locally: the detection of small, medium-sized, and large communities in large networks (2014). arXiv:1403.3795.pdf 43. M. De Domenico, A. Insolia, Entropic approach to multiscale clustering analysis. Entropy 14, 865 (2012) 44. P. Tiago Piexoto, Hierarchical block structures and high-resolution model selection in large networks. Phys. Rev. X 4, 011047 (2014) 45. S. Wiseman, M. Blatt, E. Domany, Superparamagnetic clustering of data. Phys. Rev. E 57, 3767 (1998) 46. A.L.N. Fred, A.K. Jain, Robust data clustering. In 2003 Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2 (2003), pp. 128–133 6 Inference of Hidden Structures in Complex … 137 47. M. Meil, Comparing clusterings—an information based distance. J. Multivar. Anal. 98, 873– 895 (2007) 48. L. Danon, A. Diaz-Guilera, J. Duch, A. Arenas, Comparing community structure identification. J. Stat. Mech. Theory Exp. 9, 09008 (2005) 49. G. Bianconi, Statistical mechanics of multiplex networks: entropy and overlap. Phys. Rev. E 87, 062806 (2013) 50. P. Ronhovde, S. Chakrabarty, M. Sahu, K.F. Kelton, N.A. Mauro, K.K. Sahu, Z. Nussinov, Detecting hidden spatial and spatio-temporal structures in glasses and complex physical systems by multiresolution network clustering. Eur. Phys. J. E 34, 105 (2011) 51. P. Ronhovde, S. Chakrabarty, M. Sahu, K.K. Sahu, K.F. Kelton, N. Mauro, Z. Nussinov, Detection of hidden structures on all scales in amorphous materials and complex physical systems: basic notions and applications to networks, lattice systems, and glasses. Sci. Rep. 2, 329 (2012) 52. D. Hu, P. Ronhovde, Z. Nussinov, A replica inference approach to unsupervised multi-scale image segmentation. Phys. Rev. E 85, 016101 (2012) 53. L.G. Shapiro, G.C. Stockman, Computer Vision (Prentice-Hall, New Jersey, 2001), pp. 279– 325. ISBN: 0-13-030796-3 54. J. Shi, J. Malik, Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 55. L. Wang, H. Cheng, Z. Liu, C. Zhu, A robust elastic net approach for feature learning. J. Vi. Commun. Image Represent. 25, 313 (2014) 56. A.A. Abin, F. Mahdisoltani, H. Beigy, WISECODE: wise image segmentation based on community detection. Imaging Sci. J. 62, 327 (2014) 57. H. Dandan, P. Sarder, P. Ronhovde, S. Bloch, S. Achilefu, Z. Nussinov, Automatic segmentation of fluorescence lifetime microscopy images of cells using multiresolution community detection: a first study. J. Microsc. 253(1), 54–64 (2014) 58. D. Hu, P. Sarder, P. Ronhovde, S. Bloch, S. Achilefu, Z. Nussinov, Community detection for fluorescent lifetime microscopy image segmentation. In Proceedings of the SPIE 8949, Three-Dimensional and Multidimensional Microscopy: Image Acquisition and Processing XXI (2014), p. 89491K. http://dx.doi.org/10.1117/12.2036875 59. http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/ 60. See http://www.gifford.co.uk/?principia/Illusions/dalmatian.htm 61. D. Hu, P. Ronhovde, Z. Nussinov, Phase transitions in random Potts systems and the community detection problem: spin-glass type and dynamic perspectives. Philos. Mag. 92(4), 406–445 (2012). arXiv:1008.2699 (2010) 62. H. Dandan, P. Ronhovde, Z. Nussinov, Stability-to-instability transition in the structure of large-scale networks. Phys. Rev. E 86, 066106 (2012) 63. P. Ronhovde, H. Dandan, Z. Nussinov, Global disorder transition in the community structure of large-q Potts systems. EPL (Europhys. Lett.) 99(3), 38006 (2012) 64. O. Melchert, A.K. Hartmann, Information-theoretic approach to ground-state phase transitions for two- and three-dimensional frustrated spin systems. Phys. Rev. E 87, 022107 (2013) 65. S. Cook, The complexity of theorem-proving procedures. In Proceedings of the 3rd Annual ACM Symposium on Theory of Computing (Association for Computing Mchinery, New York, 1971) pp. 151–158 66. P. Cheeseman , B. Kanefsky, W.M. Taylor, Where the really hard problems are? In Proceedings of 12th International Joint Conference on AI (IJCAI-91) Automated Reasoning vol. 1, ed. by J. Mylopoulos, R. Reiter (1991), p. 331 67. R. Monasson, R. Zecchina, S. Kirkpatrick, B. Selman, Lidror Troyansky, Nature 400, 133 (1999) 68. M. Mezard, G. Parisi, R. Zecchina, Analytic and algorithmic solution of random satisfiability problems. Science 297, 812 (2002) 69. A. Decelle, F. Krzakala, C. Moore, L. Zdeborova, Phase transition in the detection of modules in sparse networks. Phys. Rev. Lett. 107, 065701 (2011). arXiv:1102.1182 70. E. Mossel, J. Neeman, A. Sly, Stochastic block models and reconstruction (2012). arXiv:1202.1499 138 Z. Nussinov et al. 71. R.R. Nadakuditi,M.E.J. Newman, Graph spectra and the detectability of community structure in networks. Phys. Rev. Lett. 108, 188701 (2012) 72. R.K. Darst, D.R. Reichman, P. Ronhovde, Z. Nussinov, Algorithm independent bounds on community detection problems and associated transitions in stochastic block model graphs. J. Complex Netw. (2014). doi:10.1093/comnet/cnu042 73. G. Ver Steeg, C. Moore, A. Galstyan, A. Allahverdyan, Phase transitions in community detection: a solvable toy model. Europhys. Lett. 106, 48004 (2014) 74. A. Montanari, Finding one community in a sparse graph (2015). arXiv:1502.05680 75. X. Zhang, R.R. Nadakuditi, M.E.J. Newman, Spectra of random graphs with community structure and arbitrary degrees. Phys. Rev. E 89, 042816 (2014) 76. J. Reichardt, M. Leone, (Un)detectable cluster structure in sparse networks. Phys. Rev. Lett. 101, 78701 (2008) 77. S. Kirkpatrick, C.D. Gelatt Jr, M.P. Vecchi, Optimization by simulated annealing. Science 220, 671 (1983) 78. J. Villain, R. Bidaux, J.P. Carton, R. Conte, Order as an effect of disorder. J. Physique 41, 1263 (1980) 79. C.L. Henley, Ordering due to disorder in a frustrated vector antiferromagnet. Phys. Rev. Lett. 62, 2056 (1989) 80. Z. Nussinov, M. Biskup, L. Chayes, J. van den Brink, Orbital order in classical models of transition-metal compounds. Europhys. Lett. 67, 990 (2004) 81. P.G. Wolynes, Folding funnels and energy landscapes of larger proteins within the capillarity approximation. Proc. Natl. Acad. Sci. USA 94(12), 6170–6175 (1997) 82. D.S. Bassett, E.T. Owens, K.E. Daniels, M.A. Porter, Influence of network topology on sound propagation in granular materials. Phys. Rev. E 86, 041306 (2012) 83. F. Cerina, V. De Leo, M. Barthelemy, A. Chessa, Spatial correlations in attribute communities. PLoS ONE 7(5), e37507 (2012) 84. P. Holme, J. Saramaki, Temporal networks. Phys. Rep. 519, 97 (2012) 85. A. Cardillo, J. Gmez-Gardenes, M. Zanin, M. Romance, D. Papo, F. del Pozo, S. Boccaletti, Emergence of network features from multiplexity. Sci. Rep. 3, 1344 (2013) 86. G. Petri, P. Expert, Temporal stability of network partitions. Phys. Rev. E 90, 022813 (2014) 87. R.L. Jack, A.J. Dunleavy, C. Patrick Royall, Information-theoretic measurements of coupling between structure and dynamics in glass formers. Phys. Rev. Lett. 113, 095703 (2014) 88. J.-P. Bouchaud, G. Biroli, On the Adams-Gibbs-Kirkpatrick-Thirumalai-Wolynes scenario for the viscosity increase in glasses. J. Chem. Phys. 121, 7347 (2004) 89. M. Mosayebi, E.D. Gado, P. Iig, H.C. Ottinger, Probing a critical length at the glass transition. Phys. Rev. Lett. 104, 205704 (2010) 90. L. Berthier, G. Biroli, J.-P. Bouchaud, L. Cipelletti, D. El Masri, D. L’Hote, F. Ladieu, M. Pierno, Direct experimental evidence of a growing length scale accompanying the glass transition. Science 310, 1797 (2005) 91. S. Karmakar, C. Dasgupta, S. Sastry, Growing length and time scales in glass-forming liquids. Proc. Natl. Acad. Sci. USA 106, 3675 (2010) 92. J. Kurchan, D. Levine, Correlation length for amorphous systems (2009). arXiv:0904.4850 93. C. Dasgupta, A.V. Indrani, S. Ramaswamy, M.K. Phani, Is there a growing correlation length near the glass transition? Europhys Lett. 15, 307 (1991) 94. A. Montanari, G. Semerjian, Rigorous inequalities between length and time scales in glassy systems. J. Stat. Phys. 125, 23–54 (2006) 95. W. Kob, H.C. Andersen, Testing made-coupling theory for a supercooled binary Lennard-Jones mixture: the van Hove correlation function. Phys. Rev. E 51, 4626 (1995) 96. R. Soklaski, Z. Nussinov, Z. Markow, K.F. Kelton, L. Yang, Connectivity of icosahedral network and a dramatically growing static length scale in Cu-Zr binary metallic glasses. Phys. Rev. B 87, 184203 (2013) Part II Materials Prediction with Data, Simulations and High-throughput Calculations Chapter 7 On the Use of Data Mining Techniques to Build High-Density, Additively-Manufactured Parts Chandrika Kamath Abstract The determination of process parameters to build additively-manufactured parts with desired properties remains a challenge, especially as we move from machine to machine or process new materials. In this chapter, we show how we can combine simple simulations and experiments to iteratively constrain the design space of parameters, and quickly and efficiently identify parameters to create parts with >99 % density. Our approach is based on techniques from statistics and data mining, including design of physical and computational experiments, feature selection to identify important variables, and data-driven predictive models that can act as surrogates for the simulations. 7.1 Introduction Additive manufacturing (AM), a process for fabricating parts layer by layer directly from a three-dimensional digital model, presents an opportunity for producing complex, individually-customized parts not possible with traditional manufacturing processes. While AM can reduce both the time to market and material waste, a number of technical issues must still be addressed before widespread use of AM technology becomes a reality. These include gaps in measurement methods, accuracy of AM parts, process optimization to quickly build parts with desired properties, and increased confidence in properties of parts fabricated using this process [16]. In this chapter, we focus on metal AM using selective laser melting (SLM), which is a powder-based AM process where a three-dimensional part is produced layer by layer by using a high-energy laser beam to fuse metallic powder particles. We are interested in developing an approach that can be used to identify process parameters that would result in high-density (>99 %) parts. We start by describing the process of laser powder-bed fusion and discuss the current approaches to optimizing AM parts C. Kamath (B) Lawrence Livermore National Laboratory, 7000 East Avenue, Livermore, CA 94551, USA e-mail: kamath2@llnl.gov © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_7 141 142 C. Kamath Fig. 7.1 Schematic illustrating the SLM process and some of the process parameters that influence the properties of a part for high density. We then describe our approach that combines simple simulations and experiments using techniques from data mining and statistics. We illustrate our approach using 316L stainless steel as an example, and show that it is indeed possible to efficiently arrive at process parameters that result in high-density AM parts. 7.1.1 Additive Manufacturing Using Laser Powder-Bed Fusion In SLM using metal powder-bed fusion, a three-dimensional digital model of the part is first sliced into two-dimensional layers, each of a specified thickness, usually in the range of 30–100 µm. Metal powder is then spread on a base plate and the first layer is created by selectively melting the powder in the locations indicated in the first slice of the part. The next layer of powder is then spread over the first layer and the powder melted in the regions corresponding to the second slice of the part. Thus, the part is built layer by layer, with the power and speed of the laser selected so that the energy density is sufficient to melt the powder and the layer below it, integrating the new layer into the rest of the part. The design freedom afforded by AM comes with associated complexity. There are a large number of parameters, more than 130 by some estimates [19], that influence the process and thus the final quality of the part. Some of these parameters pertaining to the laser and the powder bed are shown in Fig. 7.1. The large number of parameters and the complex interactions among them make it challenging to determine the values that should be used to create parts with desired properties. 7.2 Optimizing AM Parts for Density: The Current Approach There has been much work done in finding optimal parameters that result in additively-manufactured parts with >99 % density (see for example, the summary in [10] for the work done in 316L stainless steel). Initially, the approach taken was an 7 On the Use of Data Mining Techniques … 143 experimental one, where small cubes were built to understand how various process parameters, such as powder quality, layer thickness, laser power, laser speed, and scanning strategies, would influence the density and surface roughness of a part [13, 18]. Other efforts performed a very systematic study, by carefully identifying the factors that influenced the density, surface roughness, and mechanical properties of a part, and using micrographs and various measurements to understand the effects of these factors [21]. Since much of this work was done using systems with relatively low laser powers of 50–100 W, the design space spanned by laser power and speed was not very large, making optimization through experimentation a practical option. A slightly different approach was taken by Kempen et al. in their study of process optimization for AlSi10Mg. They started with single track experiments [20], where single tracks are made on a layer of powder using a range of laser power and speed values. The resulting melt-pool characteristics were then analyzed to identify a process window for use in optimization. Tracks considered for inclusion in the window were those that met certain constraints, such as track continuity, a large height of the track to build up the part, and a connection angle of near 90◦ with the previous layer so that the part would be of high density and dimensionally accurate. A similar approach was also taken by Laohaprapanon et al. [14], who used single track experiments to narrow the space of power and speed values to use in building cubes for density optimization. More recently, with higher-powered lasers and new scan strategies expanding the design space, techniques from statistics, including the design and analysis of experiments [4, 17], have started playing a role in systematic studies to understand the influence of the parameters on various properties of the parts. For example, Delgado et al. [2] used a full factorial experimental design with three factors (layer thickness, scan speed, and build direction) and two levels per factor in their study on part quality for a fixed laser power. The outputs of interest were dimensional accuracy, mechanical properties, and surface roughness. The results of the experiments were analyzed using an ANOVA (ANalysis Of VAriance) approach to understand the effects of various factors on the outputs. To complement the insight gained into SLM using experiments, scientists are also using computer simulations to understand the relationship between processing parameters and the thermal behavior of the material as it is melted by the laser [5, 7, 11, 15]. When these three-dimensional simulations include various aspects of the physics underlying SLM, they can be quite expensive to run, even on highperformance computer systems. Our approach builds on these ideas and uses both simulations and experiments, combining the insight from each using statistics and data mining techniques. Our goal is to reduce the time it takes to determine the process parameters required to build high-density parts. 144 C. Kamath 7.3 A Data Mining Approach Combining Experiments and Simulations Despite the wealth of literature on parameters used to create high-density parts with commonly-used materials, such as 316L stainless steel, it is still a challenge to determine the appropriate parameters to use as we move from one machine to another with different power ranges or beam sizes, change powder sizes, or work with new materials. Our work was motivated by the fact that our AM machine, a Concept Laser M2 system, had a relatively narrow beam, with D4σ = 54 µm, and maximum power of 400 W. As a result, we could not use the parameters for optimal density that were available in the literature as these were for machines with lower powers of <225 W and larger beam sizes of D4σ ≈ 120 µm. Given the large range of laser power (0–400 W) for our machine, we realized that a design of experiments approach would require a large number of samples to fully explore the design space, making such an approach prohibitively expensive. We therefore needed an alternative that would help us to determine the optimal parameters for our machine efficiently. Figure 7.2 illustrates the systematic approach we devised that combines computer simulations and experiments. The approach is an iterative one. Starting with a densely-sampled design space of parameters, we run simple, and relatively inexpensive, simulations and experiments to progressively narrow the space of parameters as we move towards more expensive and accurate simulations and experiments. In each cycle, we have a set of samples that span the space of interest, which is the space of input SLM parameters. We run the experiments and/or simulations at the sample points, extract the characteristics of interest (such as the melt-pool characteristics or the density), and analyze the data that relate the sample points to the characteristics of interest. This analysis could include visualization using scatter plots or parallelcoordinate plots [8], feature selection to identify important parameters, building Fig. 7.2 Schematic illustrating the iterative process that combines simulations and experiments to reduce the time and costs to determine optimal density parameters 7 On the Use of Data Mining Techniques … 145 surrogate models for prediction, and uncertainty quantification to find regions that are less sensitive to minor changes in the parameters. As a result of this analysis, we identify a subset of samples that meet our requirements. We then perform more complex simulations and experiments at these sample points, and iterate until we have obtained the desired results. This iterative approach has several benefits. First, by starting with simple simulations and experiments, we can quickly and efficiently identify which regions of the design space are viable and which are unlikely to result in melt pools that are deep enough so that a part can be built. This is particularly relevant when we are working with materials that may not have been additively manufactured before, or with machines with different process parameters, or with powders with different size distributions. Second, the large number of parameters that have to be set in laser powder-bed fusion implies that we need to identify sample points in a highdimensional space, where the dimension of the space is the number of parameters. To span a space adequately, the number of samples we need is exponential in the dimension. This makes it prohibitively expensive to start exploring the entire space by building complex parts. Starting with simpler experiments and simulations allows us to lower the cost of exploring the space of parameters more fully, thus increasing the chance of finding all sets of parameters that yield desired properties. Third, the iterative approach enables us to progressively make larger samples and perform more complex simulations, while building on what we have learned from simpler experiments and simulations. Finally, by using data mining techniques to analyze the data from the simulations and experiments at each step, we can fully exploit the data we do collect and better guide the next set of experiments and simulations. We next describe how we used this approach to identify process parameters for high-density 316L stainless steel. We have also successfully applied this approach to create parts with >99 % density for other materials and the ideas can be extended to other properties of a part as well. 7.3.1 Using Simple Simulations to Identify Viable Parameters To identify the viable range of process parameters, we started with the very simple Eagar-Tsai (E-T) model [3] to determine under what conditions we would obtain melt pools that were deep enough to melt a layer of powder and the substrate below. E-T considers a Gaussian beam on a flat plate to describe conduction-mode laser melting. The resulting temperature distribution is then used to compute the melt-pool width, depth, and length as a function of four parameters—laser power, laser speed, beam size, and laser absorptivity of the powder. Note that the E-T model does not directly relate the process parameters to the density of a part. Further, it does not consider powder other than the effect of powder on absorptivity, so its results provide only an estimate of the melt-pool characteristics. However, we found that this estimate was sufficient to guide the next steps in our work. In addition, the simplicity of the model made it computationally inexpensive, 146 C. Kamath taking ≈1 min to run on a laptop. This allowed us to use the E-T model to sample the input parameter space rather densely, ensuring that we considered all possible viable cases. 7.3.1.1 Sampling the Design Space We used a full factorial design of computer experiments [4, 17] to explore the fourparameter input space. This method divides the range of each parameter into several levels, and then randomly selects a point in each cell. We varied the speed from 50 to 2250 mm/s with 10 levels, the power from 50 to 400 W using 7 levels, the beam size (D4σ ) from 50 to 68 µm using 3 levels, and the laser absorptivity from 0.3 to 0.5 using 2 levels. This resulted in 462 parameter combinations that were input to our simulation. The range of values for each variable was selected as follows. Our CL20 machine had a peak power of 400 W, which determined the upper bound on the power. The lower limit on the speed was set to ensure sufficient melting at the low power values such that the melt-pool depth would be at least 30 µm (the layer thickness selected for our experiments). The upper limit on the speed was estimated at a value that would likely result in a relatively shallow melt pool at the high power value. The lower and upper limits on the beam size were obtained from measurements of the beam size on our machine at focus offsets of 0 and 1 mm. By varying the beam size and the absorptivity, we were able to account for possible variations in these parameters over time or build conditions as we built the parts. 7.3.1.2 Selecting Important Input Parameters Having identified the sample points in the four-dimensional space of laser power, laser speed, beam size, and laser absorptivity, we then ran the E-T simulations at these samples points and obtained the melt-pool width, depth, and length. This output from the simulations was analyzed in several different ways. In earlier work [10], we showed how we can use parallel-coordinate plots [8] and feature selection methods from data mining [9] to identify input variables that are more relevant to the meltpool characteristics. We use the term “feature” to refer to variables, such as the input parameters, that describe a simulation. The feature selection methods we used were designed for problems with discrete data, so we first discretized the continuous input and output variables before applying the method. Since the results could potentially depend on the discretization used, in this chapter, we consider two methods that work directly with the continuous variables. The Correlation-based Feature Selection (CFS) method [6] is a simple approach that calculates a figure of merit for a feature subset of k features as Merit =  krc f k + k(k − 1)r f f (7.1) 7 On the Use of Data Mining Techniques … 147 where rc f is the average feature-output correlation and r f f is the average featurefeature correlation. We use the Pearson correlation coefficient between two vectors, X and Y , defined as Cov(X, Y ) σ X σY (7.2) where Cov(X, Y ) is the covariance between the two vectors and σ X is the standard deviation of X . A higher value of Merit results when the subset of features is such that they have a high correlation (rc f ) with the output and a low correlation (r f f ) among themselves. In the second feature selection method, the features are ranked using the mean squared error (MSE) as a measure of the quality of a feature [1]. This metric is used in regression trees (see Sect. 7.3.1.3) to determine which feature to use to split the samples at a node of the tree. Given a numeric feature x, the feature values are first sorted (x1 < x2 < · · · < xn ). Then, each intermediate value, (xi + xi+1 )/2, is proposed as a splitting point, and the samples split into two depending on whether the feature value of a sample is less than the splitting point or not. The MSE for a split A is defined as MSE(A) = p L · s(t L ) + p R · s(t R ) (7.3) where t L and t R are the subset of samples that go to the left and right, respectively, by the split based on A, p L and p R are the proportion of samples that go to the left and right, and s(t) is the standard deviation of the N (t) output values, ci , of samples in the subset t:   N (t)  1  2  s(t) = (ci − c(t) ) (7.4) N (t) i=1 For each feature, the minimum MSE across the values of the feature is obtained and the features are rank ordered by increasing values of their minimum. This method considers a feature to be important if it can split the data set into two, such that the standard deviation of the samples on either side of the split is minimized, that is, the output values are relatively similar on each side. Note that unlike CFS that considers subsets of features, this method considers each feature individually. Table 7.1 presents the ordering of subsets of input features by importance for the melt-pool width, length, and depth obtained using the CFS method. A noise feature was added as another input; this is consistently ranked as the least important variable, as might be expected. The table indicates that for the melt-pool depth and width, the single most important input is the speed, while the top two most important inputs are the speed and power. In contrast, for the length of the melt pool, the most important single input is the power, while the top two most important inputs are power and absorptivity. 148 C. Kamath Table 7.1 Rank order of subsets of the input parameters to the Eagar-Tsai model using the CFS filter Speed Power Beam size Absorptivity Noise Melt-pool width Melt-pool length Melt-pool depth 5 3 5 4 5 4 2 2 2 3 4 3 1 1 1 A higher rank indicates a more relevant input; to select the best subset of k features, select the k features with the highest ranks Table 7.2 Rank order of subsets of the input parameters to the Eagar-Tsai model using the MSE filter Speed Power Beam size Absorptivity Noise Melt-pool width Melt-pool length Melt-pool depth 5 3 5 4 5 4 2 2 1 3 4 3 1 1 2 A higher rank indicates a more relevant input Table 7.2 presents the results for the MSE filter. These are very similar to the CFS filter, with the exception that the beam size is ranked lower than the noise variable for the depth of the melt pool. For all three melt-pool characteristics, the three lowest ranked variables have the MSE value roughly the same, so the corresponding three variables have roughly the same order of importance. Given these results, since the depth and width are the most important melt-pool characteristics, we decided to investigate the effects of the two most important inputs—laser power and speed—on these characteristics. While our simple simulations relate just four inputs to the melt-pool characteristics, we expect that as we move to more complex simulations, feature selection and other dimension reduction techniques will become more useful in helping us to focus on the important variables, potentially limiting the number of experiments or simulations required to create parts with desired properties. 7.3.1.3 Data-Driven Predictive Modeling The simulation inputs and outputs can also be used to build a data-driven predictive model that can be used to predict the output values for a given set of inputs. A simple predictive model is a regression tree [1], which is similar to a decision tree, but with a continuous instead of a discrete output. A regression tree is a structure that is either a leaf, indicating a continuous value, or a decision node that specifies some test to be carried out on a feature, with a branch and sub-tree for each possible outcome of the test. If the feature is continuous, there are two branches, depending on whether the condition being tested is satisfied or not. The decision at each node of the tree is made to reveal the structure in the data. Regression trees tend to be relatively simple to implement, yield results that can be interpreted, and have built-in dimension reduction. 7 On the Use of Data Mining Techniques … 149 Regression algorithms typically have two phases. In the training phase, the algorithm is “trained” by presenting it with a set of examples with known output values. In the test phase, the model created in the training phase is tested to determine how accurately it performs in predicting the output for known examples. If the results meet expected accuracy, the model can be put into operation to predict the output for a sample point, given its inputs. The test at each node of a regression tree is determined by examining each feature and finding the split that optimizes an impurity measure. We use the mean-squared error, MSE, as defined in Sect. 7.3.1.2, as the impurity measure. The split at each node of the tree is chosen as the one that minimizes MSE across all features for the samples at that node. To avoid splitting the tree too finely, we stop the splitting if the number of samples at a node is less than 10 or the standard deviation of the values of the output variable at a node has dropped below 10 % of the standard deviation of the output variable of the original data set. The regression tree acts as a surrogate for the data from the E-T simulations and can be used to predict the width, depth and length of the melt pool for a given set of inputs. The inputs for a sample point are used to traverse the tree, following the decision at each node, until a leaf node is reached; the predicted value assigned to the sample is the mean of the output values of the training data that end up at that leaf node. Figure 7.3 shows the melt-pool depth for the E-T simulations predicted by the regression tree vs. the actual depth from the simulations. The predicted value for each sample point was obtained by creating a regression tree with all other sample points and using it to predict the melt-pool depth for the given sample point. The Predicted vs. actual depth of melt pool 300 250 Predicted depth (in micron) Fig. 7.3 Plot of predicted versus actual melt-pool depth (in micron). The predicted value for each sample point in the E-T simulations was obtained using a regression tree built with the rest of the sample points. The actual depth is obtained from the simulations. The blue line is the y = x curve 200 150 100 50 0 0 50 100 150 200 Actual depth (in micron) 250 300 150 C. Kamath percentage deviation for the entire data set is 11.2 %. This is obtained by taking the average over all sample points of the absolute value of the ratio of the residual to the actual value. The residual is the difference between the actual and predicted values. The accuracy of the regression tree depends on the number and location of the sample points, as well as the complexity of the function being modeled. If there are too few sample points, or they are in the wrong location, then the prediction can be poor, especially if the function being predicted is quite complex. For our set of simulations, the accuracy obtained is reasonable, though it could be improved further by adding new sample points in appropriate locations or by using ensembles of regression trees. In comparison with the E-T simulations, where each simulation takes ≈1 min on a laptop, it takes a few micro-seconds to build the regression tree surrogate from the 462 simulations and practically no time to generate the melt-pool depth for a set of input variables using the surrogate. 7.3.2 Using Simple Experiments to Evaluate Simulation Results We next considered some simple single-track experiments [20] to evaluate the findings from our simulations. In these experiments, a single layer of powder is spread on a plate and a single track created at a specific laser power and speed. The powder is then removed, and the plate cut so that the cross-section of the track can be obtained and the melt-pool characteristics measured. Based on prior work, we had decided to use a powder layer thickness of 30 µm as this had resulted in the highest density in experiments with 316L powder [21]. The layer thickness is the amount by which the build platform is lowered in each layer of the build. Since the powder is porous, its height decreases when it melts. Therefore, the next layer of powder has a depth greater than the set value of layer thickness. Due to the shrinkage on melting, the initial layers of powder are progressively deeper, until the thickness of the powder reaches a steady state that is determined by the amount of shrinkage of the powder on melting. When we translate the results of the E-T model to single-track experiments, we need to account for the fact that the simulations are just an approximation and there is no powder considered in the model. So the melt-pool depth from the E-T model should be sufficiently large compared with the thickness of the powder in the experiment to ensure that the substrate melts as well. We therefore focused on the simulations that gave a melt-pool depth of two to three-times the set layer thickness. Note that this factor is just an approximation that helps us to constrain the range of parameters. In addition to avoiding process parameters that resulted in relatively shallow melt pools, we also wanted to avoid those that gave very deep melt pools. Not only would this have been wasteful, but a high energy density would have resulted in the process going from conduction-mode melting to keyhole-mode melting, resulting in voids that would have introduced porosity into the part [12]. 7 On the Use of Data Mining Techniques … 151 Fig. 7.4 The 40 mm × 40 mm tilted build plate with the 14 tracks, each generated using a different value of laser power and scan speed, as listed in Table 7.3, where track 1 corresponds to the track at the top of the plate. The layer thickness is near zero at the left edge of the plate, increasing linearly to 200 µm at the right edge. The plate is cut vertically to analyze the melt-pool cross-section at a specific layer thickness Using the results from the E-T simulations, we identified fourteen power and speed combinations that we used to create tracks on a tilted plate as shown in Fig. 7.4. This 40 mm × 40 mm build plate has a tilt so that the layer thickness is 0 at the left and 200 µm at the right, enabling us to evaluate the effect of the process parameters at different layer thicknesses. Table 7.3 presents the melt-pool characteristics at a layer thickness of 30 µm. The results for the melt-pool depth are very consistent, with higher laser powers and Table 7.3 The melt-pool width, height, and depth for the 14 tracks, along with the laser power and scan speed values Track number Power (W) Speed (mm/s) Width (µm) Height (µm) Depth (µm) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 400 400 400 300 300 300 300 200 200 200 200 150 150 150 1800 1500 1200 1800 1500 1200 800 1500 1200 800 500 1200 800 500 112 103 83 94 83 111 118 84 104 123 121 79 109 115 32 79 28 57 35 76 54 26 45 24 61 21 44 40 105 119 182 65 94 114 175 57 68 116 195 30 67 120 Track 1 corresponds to the track at the top of the plate in Fig. 7.4. Powder layer thickness is 30 µm 152 C. Kamath lower speeds resulting in deeper melt pools. In addition, we observe that as the laser speed reduces, the tracks become more complete, melting more of the powder at the deeper layer thicknesses. This can be clearly seen in the three tracks at the bottom of the plate in Fig. 7.4 where, as the speed reduces from 1200 to 500 mm/s at 150 W, more of the powder melts, resulting in a complete track. These results also indicate that we have several tracks where the depth is between two and three times the layer thickness of 30 µm, making these likely process parameters for further investigation. 7.3.3 Determining Density by Building Small Pillars Thus far we have shown how we can use simple simulations to constrain the parameter space over which we perform simple experiments. These simple experiments, in turn, enable us to constrain the space over which we perform more complex experiments, which, in our case, are building small pillars to evaluate the density. There are many factors that control the density of an additively-manufactured part. Factors such as the laser power, speed, beam size, and powder layer thickness, control the density locally. So, we select the values of these parameters to ensure that the powder melts and sticks to the substrate, without leaving any un-melted powder particles that may lead to porosity. There are other factors that could also introduce porosity, such as the scan-line or hatch spacing, which controls the distance between adjacent scan lines, and the scanning strategy. For example, if adjacent scans do not overlap sufficiently, powder will accumulate in the space in-between the tracks, potentially causing porosity in the part if the laser parameters are not sufficient to melt this powder in subsequent scans. The use of island scanning could also result in porosity. Here, instead of creating each layer with a series of continuous scans, the region is divided into small “islands” that are scanned randomly [21, 22]. To ensure that the islands are connected, that is, there are no gaps created in-between adjacent islands, each island is scanned such that the scan vectors slightly overlap the surrounding islands. If the amount of overlap is set too small, this could introduce porosity in the part. To identify process parameters that would result in high-density parts, we built small pillars, 10 mm × 10 mm × 8 mm high using a variety of power and speed combinations. We used island scanning, with 5 mm × 5 mm islands. All other parameters were set to the default, as summarized in our earlier work [10]. The power and speed values for our initial set of twenty-four pillars were chosen based on the results from the single track experiments. We then evaluated the density of these pillars using the Archimedes method. Based on the results, we built another set of twenty-four pillars at the same power values as the first set, but with the speed values chosen to complete the density curves. 7 On the Use of Data Mining Techniques … 153 (a) 100.0 150W 200W 250W 300W 350W 400W 99.0 Density (in percentage) 98.0 97.0 96.0 95.0 94.0 93.0 92.0 91.0 90.0 500 1000 1500 2000 2500 3000 Speed (in mm/s) (b) Density (in percentage) 100.0 200W 250W 300W 350W 400W 99.0 98.0 97.0 96.0 500 1000 1500 2000 2500 3000 Speed (in mm/s) Fig. 7.5 Relative density as a function of laser power and scan speed. Plot (b) excludes the values for power = 150 W to illustrate the variation at high density. A quadratic function is fitted to the points for each power value. a Density for 48 316L pillars using CL powder; power 150–400 W. b Density for 40 316L pillars using CL powder; power 200–400 W 154 C. Kamath 7.4 Experimental Results Figure 7.5 shows the relative density of the forty-eight pillars for a range of power and speed values. We make several observations. First, we were able to use our approach to create pillars with >99 % relative density for power values ranging from 150 to 400 W. Second, as expected, we found that for a given power value, increasing the speed leads to insufficient melting and lower density. The density also reduces at low speed due to voids resulting from keyhole mode laser melting; this reduction is however not as large as the reduction due to insufficient melting. Finally, we found that at higher powers, the density is high over a wider range of scan speeds, unlike at lower powers. This indicates that higher powers could provide greater flexibility in choosing process parameters that optimize various properties of a manufactured part. However, it remains to be seen if operating at higher powers will have other negative effects on the microstructure or mechanical properties of a part. 7.5 Summary In this chapter, we showed how we can use techniques from statistics and data mining to reduce the time and cost of determining process parameters that lead to high-density, additively-manufactured parts. Specifically, we used design of computational experiments to understand the design space of input parameters using simple simulations, feature selection to identify important inputs, and data-driven surrogates for predictive modeling. We then built small pillars at various combinations of laser power and speed. Our experiences with 316L stainless steel and other materials indicates that our approach is a viable and cost-effective alternative to finding optimal parameters through extensive experimentation. Acknowledgments The author acknowledges the contributions of Wayne King (Eagar-Tsai model), Paul Alexander (operation of the Concept Laser M2), Mark Pearson and Cheryl Evans (metallographic preparation, measurement, and data reporting). LLNL-MI-667267: This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This work was funded by the Laboratory Directed Research and Development Program at LLNL under project tracking code 13-SI-002. References 1. L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees (CRC Press, Boca Raton, 1984) 2. J. Delgado, J. Ciurana, C.A. Rodriguez, Influence of process parameters on part quality and mechanical properties for DMLS and SLM with iron-based materials. Int. J. Adv. Manuf. Technol. 60, 601–610 (2012) 7 On the Use of Data Mining Techniques … 155 3. T.W. Eagar, N.S. Tsai, Temperature-fields produced by traveling distributed heat-sources. Weld. J. 62, S346–S355 (1983) 4. K.-T. Fang, R. Li, A. Sudjianto, Design and Modeling for Computer Experiments (Chapman and Hall/CRC Press, Boca Raton, 2005) 5. A.V. Gusarov, I. Yadoirtsev, Ph Bertrand, I. Smurov, Model of radiation and heat transfer in laser-powder interaction zone at selective laser melting. J. Heat Transf. 131, 072101 (2009) 6. M.A. Hall, Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the 17th International Conference on Machine Learning (Morgan Kaufmann, San Francisco, 2000), pp. 359–366 7. N.E. Hodge, R.M. Ferencz, J.M. Solberg, Implementation of a thermomechanical model for the simulation of selective laser melting. Comput. Mech. 54, 33–51 (2014) 8. A. Inselberg, Parallel Coordinates: Visual Multidimensional Geometry and Its Applications (Springer, New York, 2009) 9. C. Kamath, Scientific Data Mining: A Practical Perspective (Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2009) 10. C. Kamath, B. El-dasher, G.F. Gallegos, W.E. King, A. Sisto, Density of additivelymanufactured, 316L SS parts using laser powder-bed fusion at powers up to 400 W. Int. J. Adv. Manuf. Technol. 74, 65–78 (2014) 11. S.A. Khairallah, A. Anderson, Mesoscopic simulation model of selective laser melting of stainless steel powder. J. Mater. Process. Technol. 214, 2627–2636 (2014) 12. W.E. King, H.D. Barth, V.M. Castillo, G.F. Gallegos, J.W. Gibbs, D.E. Hahn, C. Kamath, A.M. Rubenchik, Observation of keyhole-mode laser melting in laser powder-bed fusion additive manufacturing. J. Mater. Process. Technol. 214, 2915–2925 (2014) 13. J.P. Kruth, M. Badrossamay, E. Yasa, J. Deckers, L. Thijs, J. Van Humbeeck, Part and material properties in selective laser melting of metals. In Proceedings of the 16th International Symposium on Electromachining (ISEM XVI), Shanghai, China, 2010 14. A. Laohaprapanon, P. Jeamwatthanachai, M. Wongcumchang, N. Chantarapanich, S. Chantaweroad, K. Sitthiseripratip, S. Wisutmethangoon, Optimal scanning condition of selective laser melting processing with stainless steel 316L powder. Material and Manufacturing Technology II, Pts 1 and 2 (Trans Tech Publications Ltd., Stafa-Zurich, 2012), pp. 816–820 15. Y. Li, D. Gu, Parametric analysis of thermal behavior during selective laser melting additive manufacturing of aluminum alloy powder. Mater. Des. 63, 856–867 (2014) 16. National Institute of Standards and Technology, Measurement Science Roadmap for MetalBased Additive Manufacturing, Technical Report, 2013 17. G.W. Oehlert, A First Course in Design and Analysis of Experiments. W.H. Freeman (2000). http://users.stat.umn.edu/~gary/Book.html 18. A.B. Spierings and G. Levy, Comparison of density of stainless steel 316L parts produced with selective laser melting using different powder grades. In Twentieth Annual International Solid Freeform Fabrication Symposium, An Additive Manufacturing Conference, ed. by D. Bourell (University of Texas at Austin, Austin, 2009), pp. 342–353 19. I. Yadroitsev, Selective Laser Melting: Direct Manufacturing of 3D-Objects by Selective Laser Melting of Metal Powders (LAP Lambert Academic Publishing, 2009) 20. I. Yadroitsev, A. Gusarov, I. Yadroitsava, I. Smurov, Single track formation in selective laser melting of metal powders. J. Mater. Process. Technol. 210, 1624–1631 (2010) 21. E. Yasa, Manufacturing by combining selective laser melting and selective laser erosion/laser re-melting. Ph.D. thesis, Faculty of Engineering, Department of Mechanical Engineering, Katholieke Universiteit Leuven, Heverlee (Leuven), 2011. Available from Katholieke Universiteit Leuven 22. E. Yasa, J. Deckers, J.P. Kruth, M. Rombouts, J. Luyten, Investigation of sectoral scanning in selective laser melting. In Proceedings of the ASME 10th Biennial Conference on Engineering Systems Design and Analysis, vol. 4 (2010), pp. 695–703 Chapter 8 Optimal Dopant Selection for Water Splitting with Cerium Oxides: Mining and Screening First Principles Data V. Botu, A.B. Mhadeshwar, S.L. Suib and R. Ramprasad Abstract We propose a powerful screening procedure, based on first principles computations and data analysis, to systematically identify suitable dopants in an oxide for the thermochemical water splitting process. The screening criteria are inspired by Sabatier’s principle, and are based on requirements placed on the thermodynamics of the elementary steps. Ceria was chosen as the parent oxide. Among the 33 dopants across the periodic table considered, Sc, Cr, Y, Zr, Pd and La are identified to be the most promising ones. Experimental evidence exists for the enhanced activity of ceria for water splitting when doped with Sc, Cr and Zr. The surface oxygen vacancy formation energy is revealed as the primary descriptor correlating with enhanced water splitting performance, while the dopant oxidation state in turn primarily governs the surface oxygen vacancy formation energy. The proposed screening strategy can be readily extended for dopant selection in other oxides for different chemical conversion processes (e.g., CO2 splitting, chemical looping, etc.). V. Botu Department of Chemical and Biomolecular Engineering, University of Connecticut, Storrs, CT 06269, USA e-mail: venkatesh.botu@uconn.edu A.B. Mhadeshwar Center for Clean Energy and Engineering, University of Connecticut, Storrs, CT 06269, USA A.B. Mhadeshwar Present Address: ExxonMobil Research and Engineering, Annandale, NJ 08801, USA S.L. Suib Department of Chemistry, University of Connecticut, Storrs, CT 06269, USA S.L. Suib · R. Ramprasad (B) Institute of Materials Science, University of Connecticut, Storrs, CT 06269, USA e-mail: rampi@ims.uconn.edu R. Ramprasad Department of Materials Science and Engineering, University of Connecticut, Storrs, CT 06269, USA © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_8 157 158 V. Botu et al. 8.1 Introduction Utilizing dopants to optimize, enhance, or fundamentally change the behavior of a parent material has been exploited in many situations ranging from material strengthening to electronics to electrochemistry. The search and identification of suitable dopant candidates has been laborious though, and dominated either by lengthy trialand-error strategies (guided by intuition) or plain serendipity. We are entering an era where such Edisonian approaches are gradually being augmented (and sometimes, replaced) by rational strategies based on advanced computational screening [1]. Often these strategies rely on first principles methods, that provide a reasonably accurate description of the underlying chemistry [2–4]. More recently, it has been shown that supplementing first principles investigations with data-driven approaches can help identify meaningful correlations within the data [5–13]. In the present contribution, we offer such a prescription for the selection of suitable dopants within cerium oxides in order to enhance the thermochemical splitting of water. Complete gas phase thermolysis of water is highly endothermic (ΔH = +2.53 eV) requiring temperatures in excess of 4000 K to be thermodynamically favorable, making such reactions unviable for H2 synthesis [14, 15]. On the other hand, partial thermolysis via a multistep process in the presence of MO catalysts provides an attractive practical alternative [15, 16]. The latter approach is performed at two distinct temperatures (both well below 4000 K): a high-temperature (≈2200 K) reduction step that involves creation of O vacancies in the MO (and the consequent evolution of O2 gas), and lower-temperature (≈900 K) oxidation steps in the presence of steam, which lead to the filling up of O vacancy centers (resulting in the evolution of H2 gas). Owing to this multistep procedure, an additional step to separate the H2 and O2 products is eliminated entirely. Equations (8.1)–(8.3) below represent a reordered version (for ease of subsequent discussion) of the multiple steps involved in this process. MO-Vo (s) + H2 O(g) −→ MO-(H)(H)(s) (8.1) MO-(H)(H)(s) −→ MO(s) + H2(g) (8.2) MO(s) −→ MO-Vo (s) + (1/2)O2(g) (8.3) The (s) and (g) subscripts represent solid and gas phases, respectively. Equations (8.1) and (8.2) are the low-temperature steps, with MO-Vo and MO-(H)(H) representing, respectively, the oxide containing an O vacancy and the oxide in which the O vacancy is filled up by a H2 O molecule (with ‘(H)(H)’ indicating that the H atoms of H2 O are adsorbed on the oxide surface). Equation (8.3) is the high-temperature activation step that leads to the creation of MO-Vo . Unfortunately, several MOs require temperatures in excess of 2700 K (leading to poor H2 production efficiencies), leaving only a subset of oxides based on Zn, Fe and Ce to be the most promising [17, 18]. Oxides of Zn and Fe are prone to sintering, phase transformation or volatility due to the proximity of the high temperature step 8 Optimal Dopant Selection for Water Splitting … 159 Fig. 8.1 Reaction pathway and energetics (red solid line) for the dissociation of H2 O on an undoped ceria surface. CeO2 -Vo is an oxide with a vacancy, CeO2 -(H)(H) is an oxide with vacancy filled by a H2 O molecule and CeO2 is a stoichiometric surface. The green dotted line shows the minimum energy pathway for dissociation. Ce, O and H are represented by beige, red and white colors respectively to their melting points [19]. CeO2 , on the other hand, displays high stability and high melting temperature (≈2600 K), and is thus overwhelmingly favored [17]. Still, the efficiency of H2 production with CeO2 is quite low (<1 %) [18]. This low efficiency is rooted in the high temperatures (>1900 K) required for the reduction step (8.3), related directly to the large O vacancy formation energy of CeO2 , along with other operational difficulties [18, 20]. Figure 8.1 shows the energies E1 , E2 and E3 of (8.1), (8.2) and (8.3), respectively, computed here using density functional theory (DFT) (details below), and helps identify the causes of the low efficiency. The dotted line indicates the uphill nature of the water splitting process. The ideal system should display E1 and E2 close to zero (for facile H2 evolution at low temperatures), and small E3 values (to alleviate the burden on the reduction step). In the case of CeO2 , E1 is too negative and E3 is too positive. A pathway to circumvent these hurdles is to control the energetics of (8.1)–(8.3) individually by the introduction of dopants (although, of course, the overall energetics of H2 O splitting cannot be altered). For instance, this strategy may be used to destabilize O in CeO2 (and thus reduce the O vacancy formation energy) [17, 21– 27]. Doping CeO2 with a plethora of elements has been explored in the recent past [28–40], and many dopants (e.g., Zr, Cr, Sc) have been shown to help significantly increase the efficiency of H2 production by reducing the temperatures required to accomplish (8.3) [32, 34, 35]. Nevertheless, a clear rationale for why a given dopant is desirable, and a framework for the systematic (non-Edisonian) selection of dopants is currently unavailable. This work attempts to fill that gap. First, we propose a framework to systematically screen for dopants, based on guidelines inspired by Sabatier’s 160 V. Botu et al. principle, then we identify the best candidates using first principles methods, and finally we use data analysis methods, specifically feature selection, to identify the primary factors that make these dopants attractive. 8.2 Screening Framework In the present first principles/data-driven based work, we consider a host of dopants in CeO2 , including 33 elements spanning the 4th, 5th and 6th period of the Periodic Table (specifically the alkali, alkaline ear th and d series elements). Assuming that the energetics of (8.1)–(8.3) determine whether a dopant is favorable or not, we define the following screening criteria to be used in a successive manner: • Criterion 1: 0 ≤ ED 3 ≤ E3 • Criterion 2: 0 ≤ ED 1 ≤δ D • Criterion 3: 0 ≤ ED 1 + E2 ≤ δ The superscripts D merely indicate that these are the energetics of doped ceria. The rationale underlying this specific choice and sequence of screening criteria stems from insights derived from Sabatier’s principle, and may be understood as follows (cf. Fig. 8.1). Criterion 1 merely states that the O vacancy formation energy (which is what ED 3 represents) should not be too small to prevent further water dissociation nor too large (certainly not larger than that of undoped ceria (E3 )) to mandate higher activation temperatures. This criterion is listed first because ED 3 appears to most strongly control the temperature requirement of the costly high-temperature step, and also because ED 3 is the easiest quantity to compute (as it does not involve the H2 O species at all). Criterion 2 states that ED 1 should also be bracketed, but by a smaller range. Noting that overall dissociation of water for undoped ceria is too negative (see Fig. 8.1), thus potentially adding an energy penalty to subsequent steps, we generously allow δ to be 1.5 eV, which is a reasonable choice considering energy uncertainties within DFT and the neglection of entropy. Criterion 3 is specific to thermochemical water splitting and bounds the overall oxidation process within δ, D D ensuring that ED 1 or E2 occur at a lower temperature compared to E3 . In the case where this no longer holds, the process fails to fall within the realm of thermochemical water splitting. 8.3 First Principles Studies 8.3.1 Methods and Models To measure the thermodynamic quantity, ED i , where i is 8.1, 8.2 or 8.3, DFT calculations were performed using the VASP code with the semi-local Perdew-BurkeErnzerhof (PBE) exchange-correlation functional and a cutoff energy of 400 eV to 8 Optimal Dopant Selection for Water Splitting … 161 accurately treat the valence O 2s, 2p and Ce 5s, 5p, 4f, 5d, 6s states [41–43]. The electron-core interactions were captured by projector-augmented (PAW) potentials, and all calculations were spin polarized to ensure the true electronic state of O and reduced Ce was captured [44]. The computed lattice parameter of bulk CeO2 (5.47 Å) is in good agreement with the corresponding experimental value (5.41 Å) [38]. A 96-atom bulk 2×2×2 supercell model and a 60-atom (2×2) surface model (5 O-Ce-O trilayers) cleaved along the (111) plane were used in all calculations. The bottom 3 trilayers of the slab were fixed to recover the bulk nature of the material, and a vacuum of 15 Å along the c axis ensured minimal spurious interactions between periodic images. A Γ -centered k-point mesh of 3×3×3 and 3×3×1 were used for the bulk and surface calculations, respectively. The Hubbard (U) correction was not applied as no universal U value captures the true electronic state of all elements. Also, given that we consider a dilute vacancy limit, the effect of electron localization is insignificant as shown previously [45, 46]. 8.3.2 Enforcing the 3-Step Criteria Dopants were introduced by replacing a single Ce atom at the center of the bulk model and at the 1st trilayer of the surface model. Our analysis indicated that the majority of the dopants favored the surface site to the bulk by ≈0.3 eV. Upon exploring the local coordination environment, a surface dopant was found to be 6-fold coordinated whereas a bulk dopant was 8-fold coordinated. Given the preference of a surface site, all dopants are assumed to occupy the surface unless specified otherwise. The primary effect of introducing dopants is to induce a local perturbation to disrupt bonding between the metallic and O atoms, thereby altering its ability to form surface O vacancies, as measured by ED 3 (cf. Fig. 8.1), computed here as 1 D D ED 3 = ECeO2 -Vo − ECeO2 + μO2 2 (8.4) D where ED CeO2 -Vo and ECeO2 are, respectively, the DFT energies of a doped surface with and without an O vacancy, and μO2 is the chemical potential of O, taken here to be the DFT energy of an isolated O2 molecule. In all cases, the O vacancy is created adjacent to the dopant. Figure 8.2 shows ED 3 for various choices of the dopants, with the dot-dashed horizontal line indicating the corresponding value for the undoped case. Dopants adopting a low valence state compared to Ce (e.g., alkali, alkaline earth and late transition series metals) display low O vacancy formation energy, consistent with the observed high O2 yield by ceria doped with Mn, Fe, Ni and Cu [47]. Conversely, dopants adopting a similar or higher valence state than Ce lead to high ED 3 values (e.g., Mo, Tc, and Ta). These trends are not entirely surprising, and have been noted before in CeO2 as well as BaTiO3 [48–50]. 162 V. Botu et al. Fig. 8.2 Oxygen vacancy formation energy (ED 3 ) of doped ceria with elements from the (a) 4th, (b) 5th and (c) 6th period of the Periodic Table. Dot-dashed maroon line indicates ED 3 for undoped ceria. Light green region indicates dopants that survived Criterion 1, while  identifies dopants that survived the 3 screening criteria ED 1 helps assess the impact of dopants on the dissociative adsorption of water on the doped surface, and is computed as D D ED 1 = ECeO2 -(H)(H) − ECeO2 -Vo − μH2 O (8.5) where ED CeO2 -(H)(H) is the DFT energy of a doped surface upon the dissociative adsorption of water at the vacancy site. Upon dissociation, OH fills the vacancy site, while H has two possible adsorption sites; atop an adjacent O or a dopant atom. Interestingly, dopants exhibiting spontaneous vacancy formation (ED 3 < 0 eV) fail to accommodate a H atop a dopant, while those dopants that do facilitate H atop a dopant have an alternative lower energy pathway for dissociation. μH2 O is the chemical potential of water, taken here to be the DFT energy of an isolated H2 O molecule. D D D D With ED 1 and E3 at hand (and E2 given by ΔH − E1 − E3 ), a plot that is equivalent to Fig. 8.1 but for the case of doped ceria surfaces is shown in Fig. 8.3. We now enforce Criterion 1, namely, 0 ≤ ED 3 ≤ E3 , with E3 = 3.3 eV (this value is consistent with past work [45]). Of the 33 dopants originally considered, 19 dopants (Sc, Ti, V, Cr, Mn, Co, Y, Zr, Nb, Ru, Rh, Pd, La, Hf, Re, Os, Ir, Pt and Au) satisfy this criterion (given by the dopants within the shaded region in Fig. 8.2). Criterion 1 picks out those dopants that alter the surface reducibility in just the appropriate manner. Next, we enforce Criterion 2, namely, 0 ≤ ED 1 ≤ δ, with δ = 1.5 eV, on the 19 dopants that pass Criterion 1, resulting in the selection of Sc, V, Cr, Co, Y, Zr, Pd, La, Hf and Au. Lastly, enforcing Criterion 3 on the 10 dopants results in the down selection of 4 promising candidates (Sc, Cr, Zr and La). Inspection of Fig. 8.3 shows that Pd and Y, although they do not pass Criterion 3, can be viewed as ‘near misses’. These are hence included in our final list of favored candidates. Figure 8.4 summarizes the list of dopants that passed each stage of the screening process. The 6 dopants identified, namely, Sc, Cr, Zr, La, Pd and Y, lead to desired energetic profiles, with D ED 1 and E2 low enough to allow for reasonable H2 O dissociation yields at moderate temperatures, and with ED 3 significantly smaller than undoped ceria allowing for low reduction temperatures (c.f., Fig. 8.3). Dopants such as Mn, Fe, Ni, Cu, Sr, Ag, and 8 Optimal Dopant Selection for Water Splitting … 163 Fig. 8.3 Reaction pathway and energetics for the multistep thermochemical splitting of H2 O on a D doped ceria surface. CeOD 2 -Vo is a doped surface with vacancy, CeO2 -(H)(H) is a doped surface with vacancy filled by a H2 O molecule and CeOD is a doped stoichiometric surface. Color solid lines 2 identify the 4 promising dopants and undoped CeO2 . Grey dashed lines identifies the non feasible dopants, while partly colored and greyed dashed lines identifies dopants that pass Criterion 1 Fig. 8.4 A hierarchical chart showing the list of dopants before and after each stage of the screening process. Sc, Cr, Zr and La were identified as the promising dopant elements, whilst Pd and Y can be viewed as the near miss cases D Ca, which display small or negative ED 3 , do not pass our tests. Although low E3 values imply facile surface reduction (this is in fact what is observed experimentally for Mn and Fe) [47], such a tendency would not be appropriate for the multistep thermochemical water splitting process targeted here (lower yields were observed for Ni, Cu and Fe doped CeO2 compared to undoped CeO2 ) [28]. Criterion 1, as mentioned above, is imposed precisely to eliminate such candidates. However, dopants that lead 164 V. Botu et al. to small or negative ED 3 may be appropriate for photocatalytic water splitting which require surface reduction to occur low temperatures (≈300 K) [51]. Of the 6 promising dopants identified, experimental evidence exists for the enhanced performance of ceria when doped with Sc, Cr and Zr for the thermochemical water splitting process. Cr doped CeO2 is known to lower the reduction and oxidation temperature to 750 and 350 K, respectively [35]. Zr and Sc dopants increase the H2 yield 4-fold and almost 2-fold, respectively, with respect to the undoped situation [28, 29, 38]. Lastly, although not conclusive, La doping appears to improve H2 yield [39, 52]. The observed performances are strong functions of the synthesis, processing and measurement details. The present work ignores such complexities, and probes only the dominant and primary chemical factors that may control performance. Irrespective of these difficulties, such a guided screening strategy has led us to some promising candidates, shown as stars in Fig. 8.2. Clearly, the best candidates display an O vacancy formation energy in the 1–2.5 eV range, i.e., neither too high nor too low, thereby respecting Sabatier’s principle. It thus appears that the O vacancy formation energy may be used as a ‘descriptor’ of the activity of doped ceria. This conclusion is consistent with an earlier similar proposal which was based on phase boundaries in surface phase diagrams of ceria exposed to an oxygen reservoir [45]. Thus far, by relying on first principles methods we are able to recognize whether a dopant increases or decreases the O vacancy formation energy, with respect to the undoped material, followed by its corresponding impact on the dissociation of water. However, an understanding of the complex dependence of the chemical attributes of a dopant and the O vacancy formation energy is absent. In the next section, with the help of data analysis methods we attempt to understand the results of the first principles computations for the spectrum of dopants considered. 8.4 Data Analysis The mining and extraction of information forms the core of the field of data analysis, which lies under a broader umbrella of methods known as machine learning (ML) [53]. Within data analysis a subset of methods, known as feature selection, allows us to unearth correlations between variables [10, 13, 53–56]. In the context of this work, the variables are the chemical factors characterizing a dopant and the corresponding O vacancy formation energy of doped ceria. Given the strong correlation between the O vacancy formation energy and the activity, as discussed above, by identifying the key dopant factors that contribute to the O vacancy formation energy, a more educated guess on its impact on the corresponding thermodynamic activity can be made. In order to discover such patterns, firstly, each dopant element needs to be represented numerically by a vector of numbers (also referred to as features or fingerprint in the ML community) that uniquely identifies the dopant element. Our choice of features stems from fundamental chemical factors, that are often used to describe 8 Optimal Dopant Selection for Water Splitting … 165 elements in the periodic table. The 7 factors considered in this work are; atomic radius (AR), ionic radius (IR), covalent radius (CR), ionization energy (IE), electronegativity (EN), electron affinity (EA) and oxidation state (OS). To eliminate any bias induced by the spread of the feature values, the dataset was normalized to a mean of 0 and variance of 1. On these set of chemical factors we use two feature selection methods: (i) principal component analysis and (ii) random forests, to narrow down the dominant factors that govern the descriptor (O vacancy formation energy). In the sections to follow we provide a brief overview of these methods and discuss the insights gained. We refer the readers to [53, 57–61] for a more exhaustive description. The data analysis routines used were implemented within the MATLAB statistical toolkit and Scikit-learn python module [62, 63]. 8.4.1 Principal Component Analysis Principal component analysis (PCA) is a common dimensionality reduction technique, often used to identify the dominant subset of features from a larger pool. By transforming the original features into uncorrelated and orthogonal pseudo variables, that are a linear combination of the original features (as done in this work, although non-linear combinations have been recently developed), it allows us to pin point the dominant contributions [10, 55–58]. The new transformed variables are referred to as principal components (PCs), which are solutions to the eigen-transformation of the covariance matrix. As with any eigen-transformation problem, the eigenvalues and eigenvectors play a critical role. The eigenvalue of a PC indicates the % of variance captured within the original dataset, whilst the eigenvector provides the coefficients that dictate the linear transformation. We shall make use of this information to down select the dominant chemical factors of a dopant. First, we plot the transformation coefficient values of the 7 features for the first and second PCs in Fig. 8.5a. Such a plot is referred to as the loadings plot, in which correlated features cluster together. Only the first and second PCs are used as it captures ≈80 % of the variance within the original dataset (c.f., inset of Fig. 8.5a). Clearly, the dopant’s OS is strongly correlated with the O vacancy formation energy. The CR, AR, IE and EN are close to orthogonal to the O vacancy formation energy, suggesting a negligible contribution to the descriptor. On the other hand, the IR and EA are not truly orthogonal, thus their contribution towards the descriptor cannot be ignored. Another interesting phenomena is the congregation of subsets of the 7 features. This isn’t entirely surprising, as one would recognize that the AR, CR are similar quantities, and their grouping in the loadings plot further validates this notion. Similarly, the IE and EN group together and appear negatively correlated to the AR and CR, given their ≈180◦ separation. By looking at the relative position of all the features in Fig. 8.5a, we can conclude that of the original 7 features considered only 3 are important; OS, IR and EA, in governing the O vacancy formation energy. Next, we use the linear transformation coefficients of the PCs to transform the original dopant dataset (also referred to as the scores plot) and plot the first and 166 V. Botu et al. Fig. 8.5 a PCA loadings plot showing the correlated dopant features. The features are; atomic radius (AR), ionic radius (IR), covalent radius (CR), ionization energy (IE), electronegativity (EN), electron affinity (EA) and oxidation state (OS). Evac is the O vacancy formation energy. The inset shows the % contribution of each PC to the variance in the dataset. The oxidation state (OS) is the dominant feature governing the O vacancy formation energy. b PCA scores plot for the first and second principal components. The dopant elements group together based on their features and the O vacancy formation energy.  represents the final 6 dopants after the 3 step screening processes. The 6 dopants occupy a sub-space of the scores plot as highlighted by the grey region second PCs in Fig. 8.5b. Each dopant element in Fig. 8.5b has further been classified according to its relative location in the periodic table (as indicated by the different marker type) and the corresponding O vacancy formation energy (marker fill color). Firstly, dopants of similar type, groups 1–2, 3–7 and 8–12 can be seen aggregating together. In particular, dopants that adopt a low valence state lie predominantly in the top/left quadrants, whilst the high valence dopants lie in the bottom/right quadrants, giving rise to an increasing O vacancy formation energy in the direction of the bottom right quadrant. Not surprisingly, amongst the low valence dopants, the alkali and alkaline earth metals further segregate from the late transition series metals, based on their differences of atomic size, amongst others. Now, upon highlighting the location of the 6 promising candidates (Sc, Cr, Y, Zr, Pd and La), as indicated by the stars, they can be seen to occupy only a small subspace of the plot (highlighted by the grey region of Fig. 8.5b). This suggests that in the high dimensional transformation these elements have similar traits, and equivalentaly a similar thermodynamic activity. Therefore, if one could identify other possible dopants that populate the grey region in Fig. 8.5b, we can further extend the chemical space to achieve improved water dissociation. 8.4.2 Random Forest Another important class of feature selection algorithms are random forests (RF). Unlike PCA, random forests work by constructing a regression (or classification) model first, in this case between the 7 features and the O vacancy formation energy, 8 Optimal Dopant Selection for Water Splitting … 167 Fig. 8.6 Relative feature importance arranged in descending order for the developed RF model. The features are; atomic radius (AR), ionic radius (IR), covalent radius (CR), ionization energy (IE), electronegativity (EN), electron affinity (EA) and oxidation state (OS). Evac is the O vacancy formation energy. The inset shows a parity plot, comparing the density functional theory (DFT) and RF predicted O vacancy formation energy (Evac ). The regression model has an R2 value of 0.94. The oxidation state (OS) is the dominant feature governing the O vacancy formation energy following which the important features are then extracted as a by-product. The framework is built upon an ensemble of individual regression models, also known as decision trees [53, 59–61]. The prediction of each individual tree is then averaged across the ensemble, resulting in the final or true predicted value. Given our limited dataset size (based on 33 dopant elements), we selected a 75 % split for training, with the remaining kept aside as validation/testing. Each decision tree in the model is then trained on a subset of the original training dataset, a procedure known as bootstrapping. The combination of bootstrapping and ensemble averaging makes RF models robust and devoid of overfitting, a common issue in ML. We generate a forest of 250 trees, based on the 7 dopant features described earlier and the O vacancy formation energy. The final regression model we obtained has an R 2 value of 0.95 (c.f., inset Fig. 8.6), suggesting a good fit. Then by using mean decrease impurity metric, we estimate the relative importance of each feature in the regression model [61]. In Fig. 8.6, we plot the relative importance of the 7 features in descending order. Clearly, the role of a dopant’s OS supersedes all others. This observation is consistent with the PCA analysis above. Also, it can be seen that IR and EA rank 2nd and 3rd in feature importance in the regression model, once again suggesting a small contribution towards the descriptor. Both the PCA and RF methods result in similar conclusions, leading us to believe that the dopant’s OS primarily governs the role of the descriptor, i.e., O vacancy formation energy, followed by a much smaller contribution of the IR and the EA. Upon revisiting the OS of the 6 promising dopants, they adopt either a +3 or +4 state. Therefore as a first measure, by understanding the coordination environment of 168 V. Botu et al. the dopant within the surface one can hazard a reasonable guess on its corresponding impact on the O vacancy formation energy. Even though many other elements such as Ti, V, Mn, Fe, Nb, Mo, Tc, Ru, Rh, Hf, Ta, Os, Ir adopt a similar OS state, the combination of the OS, IR and EA skews them out of the optimal regime. 8.5 Summary and Outlook In this work, we considered a host of dopants in cerium oxide, that span the 4th, 5th and 6th period (specifically the alkali, alkaline ear th and d series elements) of the Periodic Table, in order to understand the impact on the dissociation of water. Using a screening framework based on a first principles strategy augmented with data analysis methods, we successfully identified 6 promising dopants (Sc, Cr, Y, Zr, Pd and La), consistent with past experimental results, that are worthy of further inquiry. A dopant’s oxidation state, ionic radius and electron affinity are found to be the dominant chemical factors that primarily govern the oxygen vacancy formation energy, which in turn governs the activity. The overall framework, we believe, can be easily extended for dopant selection in ceria and other oxides as well as for different chemical conversion processes (e.g., thermochemical CO2 splitting, chemical looping, etc.). Nevertheless, some open questions remain on the true measure of activity. First, kinetic factors, such as activation barriers, have been completely ignored in the present work. All the screening criteria were based on the thermodynamic requirements of the elementary steps, and serve as necessary but not sufficient conditions. Second, it is unclear what the impact of non-zero temperatures and gas phase component pressures would be on the computed quantities and final outcomes. Preliminary assessment based on first principles thermodynamics indicates that our main conclusions will be largely unchanged even when such factors are accounted for. However, by incorporating more of such metrics, along with the guidelines from the data analysis methods, we can systematically refine the screening framework. Acknowledgments This work was supported financially by a grant from the National Science Foundation. Partial computational support through a National Science Foundation Teragrid allocation is also gratefully acknowledged. References 1. G. Ceder, K. Persson, The stuff of dreams. Sci. Am. 309, 36 (2013) 2. J. Neugebauer, T. Hickel, Density functional theory in materials science. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 3(5), 438–448 (2013) 3. G. Hautier, A. Jain, S.P. Ong, From the computer to the laboratory: materials discovery and design using first-principles calculations. J. Mater. Sci. 47, 7317 (2012) 8 Optimal Dopant Selection for Water Splitting … 169 4. A.D. Becke, Perspective: fifty years of density-functional theory in chemical physics. J. Chem. Phys. 140, 18A301 (2014) 5. T. Mueller, A.G. Kusne, R. Ramprasad, Machine learning in materials science: recent progress and emerging applications, in Reviews in Computational Chemistry, ed. by A.L. Parrill and K.B. Lipkowitz (Wiley, New York, 2016) 6. S. Srinivas, K. Rajan, property phase diagrams for compound semiconductors through data mining. Materials 6, 279 (2013) 7. G. Hautier, C.C. Fisher, A. Jain, T. Mueller, G. Ceder, Finding natures missing ternary oxide compounds using machine learning and density functional theory. Chem. Mater. 22, 3762 (2010) 8. C.C. Fischer, K.J. Tibbetts, D. Morgan, G. Ceder, Predicting crystal structure by merging data mining with quantum mechanics. Nat. Mat. 5, 641 (2006) 9. X. Zhang, L. Yu, A. Zakutayev, A. Zunger, Sorting stable versus unstable hypothetical compounds: the case of multi-functional abx half-heusler filled tetrahedral structures. Adv. Funct. Mater. 22, 1425 (2012) 10. P.V. Balachandran, S.R. Broderick, K. Rajan, Identifying the ‘inorganic gene’ for hightemperature piezoelectric perovskites through statistical learning. Proc. R. Soc. A 467, 2271 (2011) 11. E.W. Bucholtz, C.S. Kong, K.R. Marchman, W.G. Sawyer, S.R. Phillpot, S.B. Sinnot, K. Rajan, Data-driven model for estimation of friction coefficient via informatics methods. Tribol. Lett. 47, 211 (2012) 12. I.E. Castelli, K.W. Jacobsen, Designing rules and probabilistic weighting for fast materials discovery in the perovskite structure. Model. Simul. Mater. Sci. Eng. 22, 055007 (2014) 13. J. Carrete, W. Li, N. Mingo, S. Wang, S. Curtarolo, Finding unprecedentedly low-thermalconductivity half-heusler semiconductors via high-throughput materials modeling. Phys. Rev. X 4, 011019 (2014) 14. D.R. Hull, H. Prophet, Janaf thermochemical tables (2014), http://kinetics.nist.gov/janaf. Accessed 15 Jan 2014 15. S. Abanades, P. Charvin, G. Flamant, P. Neveu, Screening of water-splitting thermochemical cycles potentially attractive for hydrogen production by concentrated solar energy. Energy 31, 2805 (2006) 16. T. Nakamura, Hydrogen production from water utilizing solar heat at high temperatures. Sol. Energy 19, 467 (1977) 17. S. Abanades, G. Flamant, Solar hydrogen production from the thermal splitting of methane in a high temperature solar chemical reactor. Sol. Energy 80, 1611 (2006) 18. L. D’Souza, Thermochemical hydrogen production from water using reducible oxide materials: a critical review. Mater. Renew. Sust. Energy 2, 1 (2013) 19. W.C. Chueh, S.M. Haile, A thermochemical study of ceria: exploiting and old material for new modes of energy conversion and CO2 mitigation. Philos. Trans. R. Soc. A 368, 3269 (2010) 20. W.C. Chueh, S.M. Haile, Ceria as a thermochemical reaction medium for selectively generating syngas or methane from H2 O and CO2 . Chem. Sus. Chem. 2, 735 (2009) 21. W.C. Chueh, C. Falter, M. Abbott, D. Scipio, P. Furler, S.M. Haile, A. Steinfeld, High-flux solardriven thermochemical dissociation of CO2 and H2 O using nonstoichiometric ceria. Science 330, 1797 (2010) 22. A. Trovarelli, Catalysis by Ceria and Related Materials (World Scientific, London, 2002) 23. S. Kumar, P.K. Schelling, Density functional theory study of water adsorption at reduced and stoichiometric ceria (111) surfaces. J. Chem. Phys. 125, 204704 (2006) 24. H.T. Chen, Y.M. Choi, M. Liu, M.C. Lin, A theoretical study of surface reduction mechanisms of CeO2 (111) and (110) by H2 . Chem. Phys. Chem. 8, 849 (2007) 25. Z. Yang, Q. Wang, S. Wei, D. Ma, Q. Sun, The effect of environment on the reaction of water on the ceria(111) surface: a DFT+U study. J. Phys. Chem. C 114, 14891 (2010) 26. M. Fronzi, S. Piccinin, B. Delley, E. Traversa, C. Stampfl, Water adsorption on the stoichiometric and reduced CeO2 (111) surface: a first-principles investigation. Phys. Chem. Chem. Phys. 11, 9188 (2009) 170 V. Botu et al. 27. M. Molinari, S.C. Parker, D.C. Sayle, M.S. Islam, Water adsorption and its effect on the stability of low index stoichiometric and reduced surfaces of ceria. J. Phys. Chem. C 116, 7073 (2012) 28. Q.L. Meng, C. Lee, T. Ishihara, H. Kaneko, Y. Tamaura, Reactivity of CeO2 -based ceramics for solar hydrogen production via a two-step water-splitting cycle with concentrated solar energy. Int. J. Hydrog. Energy 36, 13435 (2011) 29. C. Lee, Q. Meng, H. Kaneko, Y. Tamaura, Solar hydrogen productivity of ceriascandia solid solution using two-step water-splitting cycle. J. Sol. Energy Eng. 1135, 011062 (2013) 30. C. Lee, Q. Meng, H. Kaneko, Y. Tamaura, Dopant effect on hydrogen generation in twostep water splitting with CeO2 -ZrO2 MOx reactive ceramics. Int. J. Hydrog. Energy 38, 15934 (2013) 31. R. Bader, L.J. Venstrom, J.H. Davidson, W. Lipinski, Thermodynamic analysis of isothermal redox cycling of ceria for solar fuel production. Energy Fuels 27, 5533 (2013) 32. L.J. Venstrom, N. Petkovich, S. Rudisill, A. Stein, J.H. Davidson, The effects of morphology on the oxidation of ceria by water and carbon dioxide. J. Sol. Energy Eng. 134, 011005 (2012) 33. G. Hua, L. Zhang, G. Fei, M. Fang, Enhanced catalytic activity induced by defects in mesoporous ceria nanotubes. J. Mater. Chem. 22, 6851 (2012) 34. J. Rossmeisl, W.G. Bessler, Trends in catalytic activity for SOFC anode materials. Solid State Ionics 178, 1694 (2008) 35. P. Singh, M.S. Hegde, Ce0.67 Cr0.33 O2 : a new low−temperature O2 evolution material and H2 generation catalyst by thermochemical splitting of water. Chem. Mater. 22, 762 (2010) 36. Y. An, M. Shen, J. Wang, Comparison of the microstructure and oxygen storage capacity modification of Ce0.67 . J. Alloy Compd. 441, 305 (2007) 37. M. Zhao, M. Shen, X. Wen, J. Wang, Ce−Zr−Sr ternary mixed oxides structural characteristics and oxygen storage capacity. J. Alloy Compd. 457, 578 (2008) 38. A.L. Gal, S. Abanades, N. Bion, T.L. Mercier, V. Harle, Reactivity of doped ceria-based mixed oxides for solar thermochemical hydrogen generation via two-step water-splitting cycles. Energy Fuels 27, 6068 (2013) 39. A.L. Gal, S. Abanades, Dopant incorporation in ceria for enhanced water-splitting activity during solar thermochemical hydrogen generation. J. Phys. Chem. C 116, 13516 (2012) 40. S. Abanades, A.L. Gal, CO2 splitting by thermo-chemical looping based on Zrx Ce1−x O2 oxygen carriers for synthetic fuel generation. Fuel 102, 180 (2012) 41. G. Kresse, J. Furthmuller, Efficient iterative schemes for ab initio total−energy calculations using a plane−wave basis set. Phys. Rev. B 54, 11169 (1996) 42. G. Kresse, D. Joubert, From ultrasoft pseudopotentials to the projector augmented−wave method. Phys. Rev. B 59, 1758 (1999) 43. J.P. Perdew, K. Burke, Y. Wang, Generalized gradient approximation for the exchangecorrelation hole of a many−electron system. Phys. Rev. B 54, 16533 (1996) 44. P.E. Blöchl, Projector augmented−wave method. Phys. Rev. B 50, 17953 (1994) 45. V. Botu, R. Ramprasad, A.B. Mhadeshwar, Ceria in an oxygen environment: surface phase equilibria and its descriptors. Surf. Sci. 619, 49 (2014) 46. M.B. Watkins, A.S. Foster, A.L. Shluger, Hydrogen cycle on CeO2 (111) surfaces: density functional theory calculations. J. Phys. Chem. C 111, 15337 (2007) 47. H. Kaneko, T. Miura, H. Ishihara, S. Taku, T. Yokoyama, H. Nakajima, Y. Tamaura, Reactive ceramics of CeO2 −MOx (M = Mn, Fe, Ni, Cu) for H2 generation by two−step water splitting using concentrated solar thermal energy. Energy 32, 656 (2007) 48. M. Krcha, A.D. Mayernick, M.J. Janik, Periodic trends of oxygen vacancy formation and c−h bond activation over transition metal−doped CeO2 (111) surfaces. J. Catal. 293, 103 (2012) 49. Z. Hu, H. Metiu, Effects of dopants on the energy of oxygen−vacancy formation at the surface of ceria: local or global. J. Phys. Chem. C 115, 17898 (2011) 50. V. Sharma, G. Pilania, G.A. Rossetti, K. Slenes, R. Ramprasad, Comprehensive examination of dopants and defects in BaTiO3 . Phys. Rev. B 87, 134109 (2013) 51. D. Channei, B. Inceesungvorn, N. Wetchakun, S. Phanichphant, A. Nakaruk, P. Koshy, C.C. Sorrell, Photocatalytic activity under visible light of Fe− nanoparticles synthesized by flame spray pyrolysis. Ceram. Int. 39, 3129 (2013) 8 Optimal Dopant Selection for Water Splitting … 171 52. T. Miki, T. Ogawa, M. Haneda, N. Kakuta, A. Ueno, S. Tateishi, S. Matsuura, M. Sato, Enhanced oxygen storage capacity of cerium oxides in CeO2 /La2 O3 /Al2 O3 containing precious metals. J. Phys. Chem. 94, 6464 (1990) 53. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. (Springer, New York, 2009) 54. I. Guyon, A. Elisseeff, An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 55. E.W. Bucholz, C.S. Kong, K.R. Marchman, W.G. Sawyer, S.R. Phillpot, S.B. Sinnott, K. Rajan, Data-driven model for estimation of friction coefficient via informatics methods. Tribol. Lett. 47(2), 211–221 (2012) 56. S.C. Sieg, C. Suh, T. Schmidt, M. Stukowski, K. Rajan, W.F. Maier, Principal component analysis of catalytic functions in the composition space of heterogeneous catalysts. QSAR Comb. Sci. 26(4), 528–535 (2007) 57. J.E. Jackson, A User’s Guide to Principal Components (Wiley, New York, 1991) 58. I.T. Jolliffe, Principal Component Analysis (Springer, New York, 2002) 59. J. Shotton A. Criminisi, E. Konukoglu, Decision forests for classification, regression, density estimation, manifold learning and semi-supervised learning. Technical Report 114, Microsoft Research Technical Report (2011) 60. L. Breiman, Random forests. Mach. Learn. 45, 5–32 (2001) 61. L. Breiman, J. Friedman, C.J. Stone, R.A. Olshen, Classification and Regression Trees, The Wadsworth and Brooks-Cole statistics-probability series (Taylor & Francis, Boca Raton, 1984) 62. MATLAB, version 8.0.0.783 (R2012b). The MathWorks Inc., Natick, Massachusetts (2012) 63. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) Chapter 9 Toward Materials Discovery with First-Principles Datasets and Learning Methods Isao Tanaka and Atsuto Seko Abstract When the rule to determine the target property is known a priori, and the computational cost for the predictors with the DFT accuracy is not too high to cover the whole library within the practical time frame, “high throughput screening” of first principles (DFT) database is a straightforward strategy for materials discovery. Otherwise we need to adopt learning methods using predictors that can cover the whole library. The learning techniques make a model to estimate the target property, which can be used for “virtual screening” of the library. Here, we show a few examples how such techniques have been used for materials discovery. 9.1 Introduction Historically materials discovery for a particular application was achieved by chance after lengthy trial-and-error iterations, neither by rational exploration of chemical compositional space, nor on the basis of clear design principles. The situation is changing because of the emergence of two important tools: One is the establishment of efficient first principles calculations with predictive performance. Thanks to the recent progress of computational power and techniques, a large number of density functional theory (DFT) calculations can be performed and the results are stored as big databases. Such databases are available for public uses now, such as Materials Project Database (MPD) [1], Automatic Flow of Materials Discovery Library (aflowlib) [2], and Open Quantum Materials Database(OQMD) [3]. The other important progress can be seen on techniques capable of efficient data mining. Combining DFT database and the data mining techniques, accelerated discovery of materials can be expected. Information techniques to solve chemistry problems have been I. Tanaka (B) · A. Seko Department of Materials Science and Engineering, Kyoto University, Kyoto 606-8501, Japan e-mail: tanaka@cms.mtl.kyoto-u.ac.jp A. Seko e-mail: seko@cms.mtl.kyoto-u.ac.jp © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_9 173 174 I. Tanaka and A. Seko called “cheminformatics” (or “chemoinformatics”). A part of the cheminformatics aiming at quantitative estimation of chemical or biological activities of chemicals from physico-chemical and structural database is called “quantitative structureactivity relationship (QSAR)” technique. The term “quantitative structure-property relationship (QSPR)” is used for similar context. Such techniques have been successful in the fields of drug discovery and organic chemistry. As for inorganic materials, however, the use of informational techniques has started just recently. This can be ascribed to the diversity of chemical elements, crystal structures and target properties of inorganic materials. Their structure-property relationships are often more complicated. Before the emergence of the DFT database, quantitative description of materials properties was very difficult. Strategies of materials exploration with the DFT calculations should be different depending upon many factors such as (1) availability of experts knowledge as a physical or phenomenological rule, (2) abundance of experimental data, (3) computational cost for estimation of physical quantities with DFT accuracy, and (4) extent of the exploration space. Figure 9.1 shows two extreme cases of materials exploration with DFT calculations. When physical rule and descriptors for the target property are well established, and all descriptors can be easily computed by an ordinary DFT method, it is possible to perform DFT calculations of all compounds in a library in order to make “high-throughput screening” as shown in Fig. 9.1a. Candidates can be “discovered” in a straightforward manner. In another extreme case shown in Fig. 9.1b, the rule to determine the target property is not known a priori. Therefore, we should consider predictors that can cover the whole exploration space. Then, learning Fig. 9.1 a Scheme of high-throughput screening with DFT datasets. b Scheme of virtual screening by combination of DFT datasets and learning methods 9 Toward Materials Discovery with First-Principles Datasets and Learning Methods 175 techniques should be used to select predictors for making a model to estimate the target property. A library can be used for “virtual screening” to find candidates. Verification process may be required to examine the predictive power of the model when “virtual screening” is made. After receiving the verification results, the model can be revised. Models and the quality of the screening can be improved iteratively through Bayesian optimization process. The virtual screening is also useful when high-throughput screening is not realistic, i.e. when the computational cost for the descriptors is too high to cover the whole library within the practical time frame. It is the same if one needs to explore too large space to cover. In this article, some of recent examples on the materials discovery with DFT dataset and learningmethods are given. 9.2 High Throughput Screening of DFT Data—Cathode Materials of Lithium ion Batteries When physics behind the target property is simple and major ingredients of the physical rule are computable by ordinary DFT methods, high throughput screening (HTS) of the DFT database is a straightforward strategy for materials discovery as shown in Fig. 9.1a. Phenomenological or empirical rules and experts knowledge can be used instead of physical rules. In such cases, selection of “good” descriptors is the critical step for the success of HTS. HTS with DFT data has been used for materials discovery of lithium ion batteries (LIB). Ceder [4] made a pioneering work to perform HTS of cathode materials for LIB from dual viewpoints, namely charge/discharge capacity and safety. Average battery voltage of a cathode material for fully delithated (charged) and fully lithiated (discharged) conditions is given by the difference of chemical potentials between fully delithated and fully lithiated conditions. Safety of oxygen-containing cathodes (e.g., oxides, phosphates, silicates etc.) can be related to the equilibrium oxygen chemical potential of the delithated state. If the oxygen chemical potential is lower, the cathode is less susceptible to burn the coexisting electrolyte in the battery, which is expected to increase the safety of the battery. The capacity-safety diagram made by a set of DFT calculations was used for the HTS of cathode materials. Many other properties are required for selecting cathode materials. When LIB is designed for large-scale energy storage for efficiently utilizing renewable energy power sources from solar and wind, a long cycle life is critical. This is different from batteries for portable devices. Cycle life is typically defined as the number of complete charge/discharge cycles before its capacity falls down to a certain level, say 70 %, of its initial value. The target of the long-life battery is more than 70 % capacity retention after 10,000 cycles. If this target is met then LIB can be practically used over 30 years with a daily charge/discharge cycle. The work by Nishijima and coworkers [5] aimed to develop cathode materials which exhibit prolonged cycle lives by substituting a range of solute elements onto the different cation sites of 176 I. Tanaka and A. Seko the LiFePO4 (LFP) material. LFP was chosen because it has advantages in cost, safety and cycle life among the range of LIB cathode materials [6]. When renewable energy-storage applications are considered, however, the cycle life of LFP needs to be further improved. The cycle life of battery cathodes is not a quantity that can be derived by a simple physical model. It is determined by the degradation rate during repeated charge/discharge cycles which is influenced by many different factors. The charge and discharge process for LFP proceeds via a two-phase reaction, and which inevitably produces interphase boundaries with different lattice parameters [7]. The volume change of crystalline lattice between LFP and fully delithiated FP is 6.5 % [6]. Micro-cracks are often formed due to the stress inside the LFP cathodes during the repeated charge/discharge cycle, which is widely accepted as the major degradation mechanism of the LFP cathode [8]. The degradation could therefore be retarded by reducing the volume change of the crystalline lattice during the charge/discharge cycle. Nishijima et al. [5] assumed that the relative volume change (RVC) of a compound between fully lithiated and delithiated conditions can be used as the descriptor for the cycle life. Then they explored a wide chemical compositional space in order to optimize solute atoms in LFP cathode materials for prolonged cycle life by systematic DFT calculations. Based upon the results of the screening, synthesis of selected materials was targeted. The strategy is similar to that in Fig. 9.1a, although the rule is based upon intuition or empirical knowledge. A large set of DFT calculations were systematically made for many different kinds of solute elements that were substituted onto three possible cation sites of LFP. Cosubstitution of aliovalent elements were made to maintain the charge neutrality by assuming that the formal ionic charges were unchanged. For example when Zr4+ and Si4+ were incorporated and were located respectively at Fe2+ and P5+ sites, two Si atoms and one Zr atom were put into the supercell of the DFT calculation. The situation can be expressed as (Zr Fe + 2 SiP ). DFT calculations were thoroughly made for all possible solute arrangements within the unit cell composed of four formula units of LFP (i.e., 28 atoms). The lowest energy structure among them was adopted as the one representing the given chemical composition. Relative volume-change (RVC) obtained for (ALi , MFe , XP ) with X = Si is shown in Fig. 9.2a. RVC was defined by 100 · (VL − VD )/VL (%), where VL and VD denote lattice volumes of lithiated and delithiated materials, respectively. As can be seen in Fig. 9.2a, RVC is notably small when M = Zr. Since substitution of Li-sites by other elements reduces the battery capacity, Nishijima et al. decided to focus their efforts on the (Zr Fe + 2 SiP )system, which was called Z2S. Its chemical formula is Li(Fe1−x Zrx )(P1−2x Si2x )O4 . DFT calculations for Z2S with supercells composed of 8 and 16 formula units were additionally made, which corresponds to x = 0.125 and 0.0625, respectively. Results of the RVC are shown in Fig. 9.2b. RVC decreases linearly with the solute concentration. Synthesis experiments were then performed for Z2S with varying x based on the results of HTS. By optimizing the processing parameters, single phase solidsolution samples were successfully synthesized. Structural analysis by the powder x-ray diffraction (XRD) showed that the Z2S samples were single phase up to x = 0.125. The sample was then supplied to electrochemical experiments. The experimental RVC is shown in Fig. 9.2b to compare with the computed values. Satisfactory 9 Toward Materials Discovery with First-Principles Datasets and Learning Methods (a) (b) 177 (ZrFe, SiP) Experiment DFT Fig. 9.2 a Relative volume change (RVC) between lithiated and delithiated co-substituted LFP for (ALi , MFe , SiP ) by DFT calculations. b Comparison of experimental and DFT-RVC for Z2S samples [5] agreements between experiments and computed results can be seen. The experimental RVC decreased linearly with x from 6.3 % (x = 0) to 3.7 % (x = 0.125). Finally the cycle life performance was examined for the Z2S cathode in a laminated pouch cell using a natural graphite anode. A cell with a pristine LFP cathode was prepared for comparison. The cycle life with 80 % capacity retention was 10,000 cycles for the cell with Z2S (x = 0.050) cathode, whereas it was 1,800 cycles for the cell with pristine cathode. The significant increase in cycle life was ascribed to the difference in the cathodes, since all other components of the cell and cell testing were the same. The cycle life for 70 % capacity retention was estimated to be 25,000 cycles for Z2S (x = 0.050) cathode, which corresponds to the lifetime of 68 years by daily charge/discharge cycles. HTS works using DFT database have been reported for many other applications. Curtarolo and coworkers [9] made a review article of such works and gave a list of descriptors for several problems such as nano-sintered thermoelectrics, topological insulators, non-proportionality in scintillators, and so on. 9.3 Combination of DFT Data and Machine Learning I—Melting Temperatures In the previous section, examples of HTS with DFT data were described. As described in the Introduction, HTS is not realistic when the computational cost for the predictors with the DFT accuracy is too high to cover the whole library within the practical time frame and/or when the rule to determine the target property is not known a priori. Then we have two choices. One is to limit the exploration space: fixing to a certain crystal structure and limiting chemical compositions are typical strategies. The other is to adopt learning techniques using predictors that can cover the whole library. The learning techniques make a model to estimate the target property, which can be used 178 I. Tanaka and A. Seko for “virtual screening” of a materials library. Here, we demonstrate applications of combination of DFT data and machine learning for three kinds of target properties. They are the melting temperature, the ionic conductivity in solid-state electrolytes, and the lattice thermal conductivity in thermoelectric materials. Experimental data for inorganic substances have been well collected for thermal properties. Let us take an example of the melting temperature for which experimental data are abundant. It may be also important that melting temperature is not keen sensitive to microstructure and sub-percent level impurities. Scattering of the experimental data by different experimental groups is therefore expected to be much smaller than that for other structure or impurity-sensitive properties. Lindemann rule [10] is often quoted as a model for explaining the melting temperature. It is based on a naive idea that melting occurs when the amplitude of thermal vibration of atoms in a substance exceeds a certain critical fraction of the interatomic distance. Although several modifications of Lindemann rule have been proposed [11–13], it is still far from predicting the melting temperature quantitatively for an arbitrary selected material. Trials to find other rules to determine melting temperatures were proposed for certain classes of materials, i.e., elemental metals [14], covalent crystals [15] and intermetallic compounds [16]. Meanwhile, a machine learning technique was applied to the prediction of the melting temperature for AB suboctet compounds [17]. Seko et al. [18] made a combined study of DFT calculations and regression techniques for prediction of the melting temperature for single and binary compounds. Experimental dataset was obtained from a standard physics and chemistry handbook [19]. Melting temperatures of the 248 compounds ranging from room temperature to 3273 K were used. The set of compounds did not contain transition metals and their compounds to avoid complexity in the DFT calculations. Two sets of predictors as shown in Table 9.1 were used for the regression. One is a set of 4 predictors, x1 to x4 , such as crystalline volume and cohesive energy, which were obtained by DFT calculations. DFT calculations were made for all polymorph structures that were given in the Inorganic Crystal Structure Database (ICSD). The physical properties of the lowest energy crystal structure were then adopted as predictors [19]. The other set of predictors, x5 to x23 , is raw or primitive information taken from the Periodic Table and the handbook, such as atomic number, atomic mass and electronegativity. Ten variables were made to be symmetric with respect to the exchange of atomic species in binary compounds to obtain 19 predictors. Note that the sum form of the composition is always unity and it was not used as a predictor. Firstly all of these 23 predictors, which were selected without much intuition, were used for modelling by regressions. These predictors were divided into two sets. Predictor set (1) is composed only of symmetric predictors of the primitive information, x5 to x23 , which does not contain information by the DFT calculations. Predictor set (2) is composed of all 23 variables, x1 to x23 , including 4 variables by the DFT calculations. In order to estimate the prediction error, the data set was divided into training and test data. A randomly selected quarter of the data set and the rest of the data set were regarded as the test and training data, respectively. This was repeated 30 times and then averages of 10-fold cross-validation (CV) scores and the root-mean-square 9 Toward Materials Discovery with First-Principles Datasets and Learning Methods 179 Table 9.1 Predictors used for a model of the melting temperatures Volume V (x1 ) Nearest-neighbor pair distance rNN (x2 ) Cohesive energy E coh (x3 ) Bulk modulus B (x4 ) Sum form Product form Composition, c Atomic number, Z Atomic mass, m Number of valence electrons, n Group, g Period, p van der Waals radius, r vdw Covalent radius, r cov Electronegativity, χ First ionization energy, I cA cB (x5 ) Z A Z B (x7 ) m A m B (x9 ) n A n B (x11 ) gA gB (x13 ) pA pB (x15 ) rAvdw rBvdw (x17 ) rAcov rBcov (x19 ) χA χB (x21 ) IA IB (x23 ) Z A + Z B (x6 ) m A + m B (x8 ) n A + n B (x10 ) gA + gB (x12 ) pA + pB (x14 ) rAvdw + rBvdw (x16 ) rAcov + rBcov (x18 ) χA + χB (x20 ) IA + IB (x22 ) A set of 4 predictors, x1 to x4 , was obtained by DFT calculations. The other set of predictors, x5 to x23 , is raw or primitive information taken from the Periodic Table and the handbook [19], which is made to be symmetric with respect to the exchange of atomic species in binary compounds [18] (RMS) errors between predicted and experimental melting temperatures for test data were evaluated. Figure 9.3 summarizes results by both ordinary least square regression (OLSR) and support vector regression (SVR) with predictor sets (1) and (2). CV scores and RMS errors are shown together. The figures were taken from one of the 30 trials of random divisions of the data set. The use of the predictor set (2) was found to significantly improve the model. At the same time it can be pointed out that SVR effectively reduced the error even when the predictor set without DFT results was used. Systematic deviation of the predicted values from the experimental ones can be seen for OLSR with the predictor set (1) in the high temperature region at above 1500 K. This can be ascribed to the difficulty in representing the high melting temperature simply by linear combination of 19 predictors included in the set (1). The situation was improved by the use of non-linear regression model of SVR with the same predictor set (1). The fitting of the high temperature part by OLSR was much improved when 4 additional DFT predictors were included as in the set (2). All results shown in Fig. 9.3 were obtained with all predictors either in the set (1) or (2). Let us think of selection of “good” predictors among them. For this purpose a stepwise regression method with bidirectional elimination [20] based on the minimization of the Akaike information criterion (AIC) [21] was adopted. The best prediction model with the minimum AIC was found to be composed of 10 predictors and has a RMS error of 295 K by the OLSR, which is smaller than the RMS error by the OLSR with all of 23 predictors. Figure 9.4a shows that the RMS error decreased rapidly and almost converged at 5 predictors. The prediction model 180 I. Tanaka and A. Seko 4000 (a) 3000 Predicted melting temperature (K) (b) CV 473 RMS 472 CV 293 RMS 306 2000 Training data 1000 Test data 0 0 4000 1000 2000 3000 4000 (c) 0 1000 3000 4000 3000 4000 (d) CV 376 RMS 364 3000 2000 CV 265 RMS 262 2000 1000 0 0 1000 2000 3000 4000 0 1000 2000 Experimental melting temperature (K) Fig. 9.3 Results by ordinary least square regression (OLSR) and support vector regression (SVR) with predictor sets (1) without DFT datasets and (2) with DFT datasets. CV scores and RMS errors in the unit of K are shown in the corresponding boxes [18]. a OLSR (Without DFT). b OLSR (With DFT). c SVR (Without DFT). d SVR (With DFT) 700 1 600 RMS error (K) 500 400 300 295 K 200 100 0 1 2 3 4 Standardized regression coefficient (a) (b) Ecoh 0.5 cAcB 0 -0.5 B rNN χA+χB 5 Number of predictors -1 Predictor Fig. 9.4 a Variation of RMS error for the prediction model of the melting temperature with the number of descriptors selected according to AIC. b The standardized regression coefficients of the prediction model with the 5 predictors [18] 9 Toward Materials Discovery with First-Principles Datasets and Learning Methods 181 Target property (a) GPR model Probability distribution Compound Highest melting temperature (K) with 5 predictors showed the RMS error of 320 K. The selected 5 predictors were E coh , χA + χB , B, cA cB , and rNN . 3 of the 5 predictors were those computed by the DFT calculation. Figure 9.4b shows the standardized regression coefficients of the prediction model with the 5 predictors. The absolute value of the standardized regression coefficient for E coh , which is the first selected by the stepwise regression, is the largest among the coefficients for the 5 predictors. Hence, E coh contributes the most to the prediction of the melting temperature. It may sound natural to find a good correlation between the melting temperature and E coh . Actually Guinea et al. [14] proposed a linear relationship model between the melting temperature and E coh for metals and alloys. Recently, a linear relationship between the melting temperature and the bulk modulus, B, was also proposed by Lejaeghere et al. [22] for elemental crystals. However, the prediction only with E coh provided poor prediction with the RMS error exceeding 430 K for the 248 compounds in the work by Seko et al. [18]. The error was even larger only with B. The facts imply that the models only with E coh or B cannot be universally applicable for predicting the melting temperature, but only useful for elemental crystals and alloys. Once the model is made by the machine learning process, it can be supplied for virtual screening as shown in Fig. 9.1b. The process will then be followed by the Bayesian optimization procedure. Here we show an example of the optimization for finding the compound with highest melting temperature by kriging by Seko et al. [18]. Kriging was built on Gaussian processes. Figure 9.5a shows a typical situation where several sample points are available. In the kriging, a next sampling point is searched where the chance of getting beyond the current best target property is optimal. To this aim, a Bayesian regression method such as a Gaussian process is applied, and the probability distribution of the target property at all possible parameter values can 3500 (b) Kriging 3000 Random 2500 2000 1500 1000 0 50 100 150 200 250 Number of observed compounds Fig. 9.5 a A typical situation of kriging. Gaussian process regression (GPR) is applied to the available samples shown by asterisks to make a prediction model is shown by the blue line. The probability distribution of the target property for all possible compounds is shown by orange closed circles. b Highest melting temperature among the observed compounds in simulations for finding the compound with the highest melting temperature based on kriging and random compound selections [18] 182 I. Tanaka and A. Seko be obtained as illustrated in Fig. 9.5a. Then the next sampling point is determined as the one with the highest probability of improvement. Here the kriging was applied to find the compound with the highest melting temperature from a pool of compounds. The procedure can be organized as follows: (1) An initial training set is first prepared by randomly choosing compounds. (2) A compound is selected based on GPR. The compound is chosen as the one with the largest probability of getting beyond the is a monotonically increasing function current best value f best . Since the probability √ of the z score, z = [ f (x ∗ ) − f best ] / v(x ∗ ), the compound with the highest z score is chosen from the pool of unobserved materials. (3) The melting temperature of the selected compound is observed. (4) The selected compound is added into the training data set. Then the simulation goes back to step (2). Steps (2)–(4) are repeated until all data of melting temperatures are included in the training set. Here the kriging of the melting temperature was started from a data set of 12 compounds. For comparison, a simulation based on the random selection of compounds was also performed. Both the kriging and random simulations were repeated 30 times and the average number of compounds required for finding the compound with the highest melting temperature was observed. Figure 9.5b shows the highest melting temperature among observed compounds during one of the 30 kriging and random trials. As can be seen in Fig. 9.5b, the compound with the highest melting temperature was found much more efficiently using the kriging. The average number of observed compounds required for finding the compounds with the highest melting temperature over 30 trials using the kriging and random compound selections were 16.1 and 133.4, respectively; hence kriging substantially improved the efficiency of discovery. 9.4 Combination of DFT Data and Machine Learning II—Lithium ion Conducting Oxides The lithium-ion conducting oxides in the system LiO1/2 –AOm/2 –BOn/2 (where m and n denote the formal valences of cations A and B, respectively) is known as LISICON (LIthium Super Ionic CONductors) [23]. It has general formula of Li8−c Aa Bb O4 (where c = ma + nb). Although the conducting properties of many different LISICONs have been intensively studied since the 1970s, there are still many compositions that have not been reported experimentally. In some cases, results from different groups vary considerably [24–28]. Arrhenius plots of Li-ion conductivities by previous experimental data are shown in Fig. 9.6a. The conductivity changes considerably depending on chemical compositions. First principles molecular dynamics (FPMD) calculations can be used to estimate the atomic diffusivity. However, high computational costs hinder their use for the purpose of HTS over a wide range of materials. Typically FPMD can be done for less than 100 ps (or 105 MD steps), which limits the lowest accessible diffusivity by FPMD to be the order of D = 10−10 m2 /s. FPMD results alone cannot be used 9 Toward Materials Discovery with First-Principles Datasets and Learning Methods (a) 183 (b) FPMD calculations Experiments Fig. 9.6 a Summary of Arrhenius plots of experimental Li-ion conductivities of LISICON compounds in literature. b Comparison of experimental and FPMD results for Li2+2x Zn1−x GeO4 (x = 0.25, 0.50 and 0.75) [29] as predictors of lower diffusivity events. Fujimura et al. made systematic FPMD calculations of LISICON materials at above 1000 K [29]. Arrhenius plots of the calculated Li-ion diffusion coefficients by FPMD calculations for Li-ion conductors, Li2+2x Zn1−x GeO4 (x = 0.25, 0.50 and 0.75) were compared with experimental data shown as open circles [24] and open triangles [30], in Fig. 9.6b. The extrapolation of the FPMD results to lower temperatures, < 800 K, where the FPMD were not practical, showed satisfactory agreements with the FPMD results for all three compositions. At the same time, one can point out the presence of deflection points in the experimental conductivity for x = 0.75 and 0.50, which is by no means reproduced by the extrapolation from the high temperature FPMD results. Fujimura et al. [29] assumed that the deflection point corresponds to the order/disorder transition temperature of Li ions on octahedral sites within the LISICON structure. The transition temperature, Tc , was then estimated by a systematic set of DFT calculations and cluster expansion analyses. The estimated Tc were 380, 750 and 1150 K for x = 0.75, 0.50 and 0.25 of Li2+2x Zn1−x GeO4 , respectively. The tendency to increase the deflection point with decrease of x was reproduced by the estimated Tc . Although FPMD results can be well extrapolated to lower temperatures at above Tc , prediction of the conductivity at below Tc is difficult. In order to estimate the diffusivity near the room temperature, which is typically below Tc , one needs to use additional predictors on top of FPMD diffusivity and Tc . In order to select the predictors and examine the prediction error, the raw experimental data-points shown in Fig. 9.6a were used for machine learning. Experimental diffusivities, D(T ), at temperature T were “learned” to make prediction models of the diffusivity at a given temperature, T0 . Fujimura et al. [29] used T , D1600 (D at 1600 K), Tc and crystalline volume of disordered structure, Vdis , as predictors for the ionic conductivity at T0 = 373 K, σ373 using the support-vector regression (SVR) method with a Gaussian kernel [31]. 184 I. Tanaka and A. Seko RMS error (a) 0.6 0.3 0 T D1600 (b) T D1600 T D1600 T D1600 Tc Vdis Tc, Vdis Predictor sets Fig. 9.7 a Variation of RMS error for the Li-ion conductivities at 373 K, σ373 , with four different sets of predictors. b Predicted σ373 for 72 compositions of LISICON compounds with the model using all of four predictors [29] The variance of the Gaussian kernel, the regularization constant and forms of independent variables were optimized by minimizing the prediction error estimated by the bootstrapping method [32]. Vdis was calculated by averaging volumes calculated for a few structures with randomly selected Li ion-arrangements on octahedral sites by the DFT method. Prediction errors of the SVR prediction models for σ373 with four different sets of predictors and are compared in Fig. 9.7a. The error increased when only Tc was added on T and D1600 . This sounds odd from the physical mechanism viewpoint. However, this can be understood by looking at the experimental data shown in Fig. 9.6a which does not always exhibit two stages separated by the deflection point of the conductivity. The error was lower when Vdis was included in the predictor set. The lowest error was obtained when all of T , D1600 , Tc and Vdis were used. Figure 9.7b shows the predicted σ373 for 72 compositions. Even though the theoretical datasets do not contain information about the activation energies explicitly, systems with high D1600 and low Tc tend to have high σ373 as expected. The conductivity of compounds with 9 Toward Materials Discovery with First-Principles Datasets and Learning Methods 185 low Zn content such as Li2+2x Zn1−x GeO4 (x = 0.75) with high D1600 and low Tc were greater than those with high Zn content, such as Li2+2x Zn1−x GeO4 (x = 0.25), and high Tc . This result explained the trend observed by experimentalists, namely that the original LISICON composition Li3.5 Zn0.25 GeO4 has one of the highest Liion conductivity. In this study, Li4 GeO4 was predicted to have the highest σ373 of all 72 compounds. However, it has not yet been synthesized because it generally crystallizes to a different crystal structure. 9.5 Combination of DFT Data and Machine Learning III—Thermoelectric Materials Thermoelectric generators are essential for utilizing waste heat. In order to increase the conversion efficiency, the thermoelectric figure of merit should be increased. Since the figure of merit is inversely proportional to thermal conductivity, many efforts have been placed to decrease the thermal conductivity, especially the lattice thermal conductivity (LTC). In order to evaluate LTC with the accuracy comparable to experimental data, a method that is far beyond the ordinary density functional theory (DFT) calculations is required. Since one needs to treat multiple interactions among phonons, or anharmonic lattice dynamics, the computational cost is many orders of magnitudes higher than the ordinary DFT calculations of primitive cells. Such expensive calculations are practically possible only for a small number of simple compounds. HTS of a large DFT database of LTC is not a realistic approach unless the exploration space is narrowly confined. Carrete and coworkers concentrated their efforts to search low LTC materials within half-Heusler compounds [33]. They made HTS of wide variety of half-Heusler compounds by examination of thermodynamical stability via DFT results. Then LTC was estimated either by full first principles calculations or by a machine-learning algorithm for a selected small number of compounds. HTS of low LTC using a quasiharmonic Debye model was also reported [34]. Efficient prediction of LTC through compressive sensing of lattice dynamics was demonstrated as well [35]. Very recently, Togo et al. [36] reported a method to systematically obtain theoretical LTC through first principles anharmonic lattice dynamics calculations. Results were in quantitative agreement to available experimental data. Using these theoretical data, Seko et al. [37] performed the virtual screening of a library containing 54,779 compounds was by Bayesian optimization using kriging method based on the Gaussian process regressions (see Sect. 9.3). First principles anharmonic lattice dynamics calculations were then performed for highly ranked compounds, which actually showed very low LTC. The strategy is in the category given in Fig. 9.1b. This type of method should be useful for searching materials for many different applications in which chemistry of materials need to be optimized. 186 I. Tanaka and A. Seko References 1. A. Jain, S.P. Ong, G. Hautier, W. Chen, W.D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder et al., APL Mater. 1(1), 011002 (2013) 2. S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R.H. Taylor, L.J. Nelson, G.L. Hart, S. Sanvito, M. Buongiorno-Nardelli et al., Comput. Mater. Sci. 58, 227 (2012) 3. J.E. Saal, S. Kirklin, M. Aykol, B. Meredig, C. Wolverton, JOM 65(11), 1501 (2013) 4. G. Ceder, MRS bull. 35(9), 693 (2010) 5. M. Nishijima, T. Ootani, Y. Kamimura, T. Sueki, S. Esaki, S. Murai, K. Fujita, K. Tanaka, K. Ohira, Y. Koyama, et al., Nat. Commun. 5 (2014) 6. A.K. Padhi, K. Nanjundaswamy, J. Goodenough, J. Electrochem. Soc. 144(4), 1188 (1997) 7. C. Delmas, M. Maccario, L. Croguennec, F. Le Cras, F. Weill, Nat. Mater. 7(8), 665 (2008) 8. D. Wang, X. Wu, Z. Wang, L. Chen, J. Power Sources 140(1), 125 (2005) 9. S. Curtarolo, G.L. Hart, M.B. Nardelli, N. Mingo, S. Sanvito, O. Levy, Nat. Mater. 12(3), 191 (2013) 10. F.A. Lindemann, Phys. Z. 11, 609 (1910) 11. A. Lawson, Phil. Mag. 81(3), 255 (2001) 12. A.C. Lawson, Phil. Mag. 89(22–24), 1757 (2009) 13. A. Granato, D. Joncich, V. Khonik, Appl. Phys. Lett. 97(17), 171911 (2010) 14. F. Guinea, J.H. Rose, J.R. Smith, J. Ferrante, Appl. Phys. Lett. 44, 53 (1984) 15. J.A. Van Vechten, Phys. Rev. Lett. 29, 769 (1972) 16. J.R. Chelikowsky, K.E. Anderson, J. Phys. Chem. Solids 48, 197 (1987) 17. Y. Saad, D. Gao, T. Ngo, S. Bobbitt, J.R. Chelikowsky, W. Andreoni, Phys. Rev. B 85, 104104 (2012) 18. A. Seko, T. Maekawa, K. Tsuda, I. Tanaka, Phys. Rev. B 89(5), 054303 (2014) 19. W.M. Haynes, CRC Handbook of Chemistry and Physics, 92nd edn. (CRC Press, Boca Raton, 2012) 20. W.N. Venables, B.D. Ripley, Modern Applied Statistics with S, 4th edn. (Springer, New York, 2002) 21. H. Akaike, in Second International Symposium on Information Theory, (Akademinai Kiado, 1973), pp. 267–281 22. K. Lejaeghere, J. Jaeken, V. Van Speybroeck, S. Cottenier, Phys. Rev. B 89(1), 014304 (2014) 23. A. Robertson, A. West, A. Ritchie, Solid State Ionics 104(1), 1 (1997) 24. H.P. Hong, Mater. Res. Bull. 13(2), 117 (1978) 25. U. Alpen, M. Bell, W. Wichelhaus, K. Cheung, G. Dudley, Electrochim. Acta 23(12), 1395 (1978) 26. D. Mazumdar, D. Bose, M. Mukherjee, Solid state Ionics 14(2), 143 (1984) 27. P. Bruce, A. West, J. Solid State Chem. 44(3), 354 (1982) 28. P. Bruce, I. Abrahams, J. Solid State Chem. 95(1), 74 (1991) 29. K. Fujimura, A. Seko, Y. Koyama, A. Kuwabara, I. Kishida, K. Shitara, C.A.J. Fisher, H. Moriwake, I. Tanaka, Adv. Energy Mater. 3(8), 980 (2013) 30. S. Takai, K. Kurihara, K. Yoneda, S. Fujine, Y. Kawabata, T. Esaka, Solid State Ionics 171(1), 107 (2004) 31. C.C. Chang, C.J. Lin, A.C.M. Trans, Intell. Syst. Tech. 2(3), 27 (2011) 32. B. Efron, R.J. Tibshirani, An Introduction to the Bootstrap (CRC press, Boca Raton, 1994) 33. J. Carrete, W. Li, N. Mingo, S. Wang, S. Curtarolo, Phys. Rev. X 4(1), 011019 (2014) 34. C. Toher, J.J. Plata, O. Levy, M. de Jong, M. Asta, M.B. Nardelli, S. Curtarolo, Phys. Rev. B 90(17), 174107 (2014) 35. F. Zhou, W. Nielson, Y. Xia, V. Ozoliņš, Phys. Rev. Lett. 113(18), 185501 (2014) 36. A. Togo, L. Chaput, I. Tanaka, Phys. Rev. B 91(9), 094306 (2015) 37. A. Seko, A. Togo, H. Hayashi, K. Tsuda, L. Chaput, I. Tanaka, arXiv preprint arXiv:1506.06439 (2015) Chapter 10 Materials Informatics Using Ab initio Data: Application to MAX Phases Wai-Yim Ching Abstract We use a database constructed for a very unique class of laminated intermetallic compounds, the MAX (Mn+1 AXn ) phase, to show how materials informatics can be used to predict the existence of new, hitherto unexplored phases. The focus of this Chapter is the correlation between seemingly disconnected descriptors and the importance of high quality, computationally derived data. An extension of this approach to other specific materials systems is discussed. 10.1 Introduction In recent years, information gathering, analysis, and interpretation has emerged as an interdisciplinary research skill involving computer science, information science, and other various domains of science such as physics, chemistry, biology, medicine, materials engineering, design technology, education and social science, etc. [1]. In particular, materials informatics has developed into a flourishing field of study [2]. It aims to find more efficient ways of solving scientific problems related to all kinds on materials using large databases. This started at the initiation of the materials genome project at the federal level and it follows the same approach that the Human Genome Project in the biomedical community did decades ago, which resulted in the now mature discipline of bioinformatics. Creative software, genetic algorithms, and visualization tools have been developed to do statistical analysis of data and to explore the data via data mining aided by powerful high performance computers [3]. There are many examples of highly successful applications for identifying and understanding the structure-properties correlations and to formulate design rules for better materials for specific applications. The information obtained from high W.-Y. Ching (B) Curators Professor of Physics, 250C Robert H. Flarsheim Hall, 5100 Rockhill Road, Kansas City, MO 64110-2499, USA e-mail: ChingW@umkc.edu URL: http://cas.umkc.edu/physics/ching/index.htm © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_10 187 188 W.-Y. Ching throughput materials informatics greatly reduces the time that it takes to go from frontier research to real applications. There are many different ways of collecting large data and of building powerful databases for applications. Traditionally, the data for materials properties are collected from experimentally measured values published in open literature such as: crystal structures, density, heat of formations, melting temperature, electric conductivity, thermal conductivity, refractive index, bulk modulus, hardness, phase diagrams, and much more. These data cover all kinds of material systems regardless of the source or the reliability of the data. Such database are usually not vetted and they are of varying quality. However, the argument is that in a statistical sense, any invalid data that appears does so as noise and will not make much of a difference, as long as the database is large enough and the method for analysis is carefully designed. This modus operandi is more common in the biomedical arena when dealing with experimental or clinical trials with large data collected over a long period of time while looking for small effects [4]. Contrary to some approaches which aim to reduce or avoid accurate atomistic simulations by instead relying purely on statistical predictions, is another approach that is based on the design of a specific data base with high accuracy using computational genomics. This difference simply reflects the emphasis on a different spectrum of the field of materials informatics with different strategies for different systems although both are data driven. More recently, large amounts of data may be obtained through calculations using different computational methods and packages based on different theories. The trend is usually to cover a focused groups of materials that are categorized either in their structure, composition, functionality, or some specific materials property. Examples for such recent endeavors include: piezoelectric perovskites [5], battery materials comprising oxides, phosphates, borates, silicates sulphates, etc. [3, 6], Half-Heusler semiconductors with low thermoconductivity [7], binary compounds [8], polymer physics materials genome [9], and isotope substitution on phonons in graphene [10] just to name a few. It is also possible to combine the measured data and the calculated data into a bigger database. In this Chapter, we present a specific case to illustrate the application of materials informatics using a large database of a unique class of materials, the MAX (Mn+1 AXn ) phase [11]. Our approach is to select a specific material system with well-defined structures and compositions for a focused study and then apply stateof-the art computational tools to systematically generate a large amount of data on their physical properties, and the analysis of correlations amongst them. We then use this database to test the efficacy of exiting data mining and machine learning algorithms. Simultaneously, this enables us to predict the existence of new MAX phases that have not yet been synthesized or studied in the laboratory but which may have outstanding properties. The identification of outliers that clearly do not follow general trends helps to obtain deeper insights and reveal the fundamental reasons behind such deviations. The predictive capability of the data mining is substantially controlled by the quality of the assigned descriptors. At the same time, use of theoretical-based descriptors that demand a large computational time will be impractical. Thus, delegating such types of descriptors to a combinations of less 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 189 time-demanding descriptors will be the goal. This approach is certainly different from other approaches which depend on collecting data from various sources but it puts the data under better control with increased reliability in the interpretations. Another important issue which is less frequently discussed in materials informatics is the way that the data are presented. Many believe that materials informatics relies on massive data collections and their statistical analysis. Everything is numerical and machine-based. On the other hand, we find that creative and insightful graphical representations of the data can allow one to grasp some of the most important points without laborious analysis. This will be amply demonstrated for the materials presented in this Chapter. The MAX phase is used as an example to illustrate various aspects of materials properties and correlations between different descriptors. We have identified one such descriptor in particular, the total bond order density (TBOD) that plays a dominating role. We articulate on some other materials systems for which the application of the same approach and the use of TBOD can be very fruitful. 10.2 MAX Phases: A Unique Class of Material MAX phases or Mn+1 AXn are transition metal ternary compounds with layered structures where “M” is an early transition metal, “A” is a metalloid element, “X” is either carbon or nitrogen, and n is the layer index. MAX compounds have attracted a great deal of attention in recent decades due to many of their fascinating properties and wide range of potential applications. Up to now, only about 70 of these phases are confirmed or have been synthesized [12]. The majority of these confirmed phases are 211 carbides with n = 1 or 2 and with M = Ti and Zr, and A = Al and Ga. It has also been demonstrated that the formation of composite phases and solid solutions in MAX phases between different “M” elements, “A” elements and C and N are possible. Such possibilities have greatly extended the range of compositions beyond the ternary phases. The MAX phases are layered hexagonal crystals (space group: P63 /mmc No. 194). Figure 10.1 displays the crystal structure of MAX for n = 1, 2, 3, 4, which are usually referred to as the 211, 312, 413, and 514 phases. An important feature is that in MAX compounds, layer “A” remains constant whereas layers of “M” and “X” increase with n. The “X” layers are always in between the “M” layers or blocks of MX layers connected by single “A” layer which can significantly affect the properties of a MAX phase. The physical properties of MAX phases vary over a wide range depending on “M”, “A”, “X” and n. MAX phases with n ≥ 5 are known to exist but are very rare. Most of the existing experimental work on the MAX phases has been on the 211 and 312 carbides. MAX phases behave both like ceramics and metals with some very desirable properties such as machinability, thermal shock resistance, damage tolerance, fatigue, creep, and oxidation resistance and elastic stiffness. They are also good thermal and electrical conductors [12]. More recently, MAX phases have been considered for 190 W.-Y. Ching Fig. 10.1 Sketch of crystal structures of four MAX phases M2 AX, M3 AX2 , M4 AX3 , M5 AX4 (i.e. with n = 1, 2, 3, 4) M X M MX A M2AX X M3AX X X M M A M X A M4AX M M A M5AX4 high-temperature structural applications. Other applications include porous exhaust gas filters for automobiles, heat exchangers, heating elements, wear and corrosion protective surface coatings, electrodes, resistors, capacitors, rotating electrical contacts, nuclear applications, as bio-compatible materials, cutting tools, nozzles, tools for die pressing, impact-resistant materials, projectile proof armor, and much more. Some of these applications already have products on the market. The physical properties of MAX phases have been investigated by many groups both experimentally and computationally (see extensive references in Aryal et al. [11]). We have been focusing mostly on the mechanical properties and electronic structure of MAX phases. The elastic coefficients and mechanical parameters such as the bulk modulus (K), shear modulus (G), Young’s modulus (E), and Poisson’s ratio (η) are derived from the calculated elastic coefficients under the VRH polycrystalline approximation and were obtained. The G/K ratio, also known as the Pugh moduli ratio is a good indicator of the ductility or brittleness of the alloy based on an analysis in pure metals but it has also been quite effective when applied to metallic alloys [13]. The other physical properties investigated are the optical conductivities in 20 MAX phases [14] and the core-level excitations in some of the compounds [15]. More recently, we also estimated the high temperature lattice thermal conductivities of MAX phases (see Sect. 10.4). The electronic structure and bonding is the basic information needed to understand the properties in any materials. It has been well studied for MAX phases using density functional theory-based methods by many groups over the last 15 years. Most of the discussion tends to be on the band structure and the density of states (DOS) and partial density of states (PDOS). In MAX phases, interatomic bonding 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 191 are fairly complicated involving metallic, partly covalent and partly ionic types of bonding which may extend beyond nearest neighbors. The structural complexity and variations in chemical species make characterization of interatomic bonding in MAX phases particularly challenging. We advocate the use of total bond order (TBO), total bond order density (TBOD) and their partial components (PBOD) as useful metrics to delineate the observed physical properties. TBO is the sum of all bond order pairs in the crystal and when normalized by its volume, we get the TBOD. This will be illustrated more in the following sections. It is worth mentioning that in addition to the canonical MAX phases, there are related materials derived from the MAX phases such as the solid solutions with different “M” or “A” elements and with mixtures of C and N. The MAX solid solutions can expand the list of such compounds enormously and some of them may have optimized compositions that enhance their properties. This provided a great opportunity to apply the techniques of materials informatics for facilitating the processing of large amounts of data. Other related systems include the 2-dimentional Mn+1 Xn system called Maxenes by extracting the “A” layer from MAX by exfoliating in solution which offers a variety of new applications. Last but not least, there are quite a few layered intermetallic compounds with different types of stacking layers but involving more or less similar chemical species that have not been fully exploited. 10.3 Applications of Materials Informatics to MAX Phases 10.3.1 Initial Screening and Construction of the MAX Database We first construct a database consisting of as many MAX phases in accordance with the general guideline suggested by Barsoum (Barsoum’s Book, page 2, Fig. 1.2 [12]). We chose 9 “M” elements (Sc, Ti, Zr, H, V, Nb, Ta, Cr, Mo), 11 “A” elements (Al, Ga, In, Tl, Si, Ge, Sn, Pb, P, As, S), X = C and N and with the layer index n = 1, 2, 3, 4. This gives us a total of 792 possible MAX (Mn+1 AXn ) phases. We used the Vienna Ab initio Simulation Package (VASP) [16] to optimize the structure and obtain the elastic constants for each crystal. However, not all of these phases will be stable. We therefore screen these by using two stability criteria: the Cauchy-Born elastic stability criteria for hexagonal crystals [17] which eliminated 71 crystals. Next, we calculated the heat of formation (HoF) for the same 792 crystals based on the relative stability of each MAX phase to the formation energy of its elements in their most stable ground state structure. As a result, 45 additional phases with positive HoF were eliminated, resulting in 665 viable MAX phases for a more focused study. The use of these two criteria instead of a more rigorous but far more time consuming one based on thermodynamic assessment on all potential competing phases in the M-A-X ternary phase diagrams is a reasonable compromise. In principle, we can consider these two sets of criteria employed as two descriptors in the data mining 192 W.-Y. Ching approach. This represents a substantial savings in computational time. The calculated elastic and mechanical properties of the 665 MAX phases are tabulated as illustrated in Table 10.1 for 20 such phases. The electronic structure and bonding of the MAX phases are calculated using the orthogonalized linear combination of atomic orbital method (OLCAO) [18]. This is an extremely efficient and well-tested method using atomic orbitals in the basis expansion. The main descriptors for electronic properties are summarized in tabular form as illustrated in Table 10.2 for 14 such crystals. Both sets of data for the 665 MAX phases are publically available [11]. 10.3.2 Representative Results on Mechanical Properties and Electronic Structure of MAX We selectively present some of the calculated results from the database for the 665 MAX phases. Figure 10.2 shows a scattered plot of shear modulus G versus bulk modulus K for all screened 665 MAX phases. To have a broader perspective, we used different colors for index n, and open or closed symbols for carbides and nitrides respectively. We also include similar data for some metallic compounds and selected binary MX compounds [19]. We note that the MAX phases cover a wide region of bulk and shear moduli, overlapping with those of the common metals and alloys. The dashed lines show the G/K ratios for these data which range from a minimum of 0.12 to a maximum of 0.8. The maximum G/K ratio is close to those of MN binary compounds and the low G/K values are mostly from MAX nitrides. Figure 10.2 illustrates a conventional graphical presentation in materials informatics to provide an overview of the data from a large database. The data for G/K values for all MAX phases shown in Fig. 10.2 as scattered plot data are presented in Fig. 10.3 in a different way in the form of an innovative map resembling the Periodic Table. For this plot we used the original 792 hypothetical MAX phase data. This enables us to clearly see the locations of those phases that have been screened out relative to those that have not. Here the “M” elements are plotted on the Y-axes and the “A” elements are along the X-axes. The color of each square cell represents the G/K value of that particular MAX phase along with other information such as whether the phase has been synthesized or not. The phases that have been eliminated by the Cauchy-Born criterion or the HoF criteria are marked with a + or a × respectively. The experimentally confirmed phases are marked with a white star. As can be seen, none of the experimentally confirmed phases are among the ones judged to be unstable and screened out. There are many boxes of different colors without the white star, suggesting the existence of a myriad of possible MAX phases not yet explored. While the G/K ratio of MAX phases can vary over a wide range as indicated by the variations in color for the different squares in Fig. 10.3, we can delineate the boundaries of MAX’s materials properties within which optimized functionalities can be further explored. Similar maps for other mechanical properties for the MAX phases can be found in [11]. C11 355.8 369.6 362.0 301.9 300.8 284.4 312.9 296.6 262.6 256.8 212.9 339.8 312.9 334.4 316.6 366.3 344.5 453.6 459.2 481.5 Crystals Ti3 AlC2 Ti3 SiC2 Ti3 GeC2 Ti2 AlC Ti2 GaC Ti2 InC Ti2 SiC Ti2 GeC Ti2 SnC Ti2 PC Ti2 AsC Ti2 SC Ti2 AlN V2 AlC Nb2 AlC Cr2 AlC Ta2 AlC α−Ta3 AlC2 α−Ta4 AlC3 Ta5 AlC4 81.4 96.2 97.2 68.0 79.2 69.3 82.1 85.7 88.6 144.8 180.4 101.4 73.0 71.5 86.3 85.8 112.2 130.5 149.1 149.6 C12 75.3 107.6 97.7 63.0 63.8 55.2 110.4 96.8 73.1 155.0 123.7 109.7 95.5 106.0 117.0 111.3 137.1 135.6 148.7 158.1 C13 293.4 358.3 332.0 267.9 246.5 235.5 329.2 297.1 255.2 339.5 289.5 361.9 290.7 320.8 288.6 356.9 327.9 388.4 383.1 423.6 C33 120.3 155.0 137.3 105.1 92.4 83.9 149.6 121.5 96.8 166.3 146.3 159.5 126.1 149.8 137.6 142.9 152.3 175.0 170.5 188.8 C44 137.2 136.7 132.4 117.0 110.8 107.5 115.4 105.5 87.0 56.0 16.2 119.2 120.0 131.5 115.2 140.2 116.1 161.5 155.0 165.9 C66 Table 10.1 Samples of descriptors for mechanical properties in the database (Units in GPA) 162.5 191.1 182.2 139.7 139.3 128.6 173.0 161.0 138.8 191.8 150.7 186.8 160.5 172.9 173.6 189.6 198.8 232.8 243.0 257.2 K 126.7 141.3 132.2 110.5 101.4 96.0 124.9 110.0 92.4 93.1 57.2 134.4 117.4 132.1 116.4 137.0 124.1 161.1 155.3 169.1 G 301.7 340.0 319.3 262.3 244.9 230.5 302.0 268.8 226.8 240.4 152.3 325.2 283.1 315.9 285.5 331.2 308.1 392.8 384.1 416.0 E 0.191 0.204 0.208 0.187 0.207 0.201 0.209 0.222 0.228 0.291 0.332 0.210 0.206 0.196 0.226 0.209 0.242 0.219 0.237 0.231 η 0.78 0.74 0.73 0.79 0.73 0.75 0.72 0.68 0.67 0.49 0.38 0.72 0.73 0.76 0.67 0.72 0.62 0.69 0.64 0.66 G/K 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 193 −0.043 0.269 0.148 0.097 0.324 0.069 0.210 0.316 0.189 −0.087 −0.101 0.245 −0.324 −0.044 −0.330 −0.485 −0.424 −0.393 −0.509 −0.381 −0.454 −0.505 −0.447 −0.295 −0.277 −0.493 −0.098 −0.324 Ti2 AlC Ti2 GaC Ti2 InC Ti2 SiC Ti2 GeC Ti2 SnC Ti2 PC Ti2 AsC Ti2 SC Ti2 AlN V2 AlC Nb2 AlC Cr2 AlC Ta2 AlC 0.703 0.701 0.700 0.688 0.694 0.693 0.699 0.695 0.705 0.679 0.655 0.741 0.521 0.692 Q ∗ (A) Q* and bond orders (electrons) and N(EF ) (states/eV-cell) Q ∗ (X) Q ∗ (M) MAX 23.510 22.680 22.750 22.820 21.750 22.320 22.740 21.360 21.340 22.150 22.820 15.410 21.250 24.810 TBO Table 10.2 Samples of descriptors from electronic structure in the database 10.258 10.289 10.238 10.344 10.337 10.294 10.366 10.382 10.380 8.702 10.017 7.319 9.559 10.130 BO(M-X) 4.512 4.060 4.396 3.583 3.541 3.926 2.802 2.893 2.944 4.646 4.192 1.253 2.837 5.724 BO(M-M) 7.231 6.986 6.482 8.153 7.111 7.110 9.571 8.086 8.018 7.217 6.905 5.354 7.080 7.561 BO(M-A) 1.508 1.340 1.636 0.742 0.758 0.993 0.000 0.000 0.000 1.585 1.704 1.399 1.769 1.397 BO(A-A) 11.052 10.572 9.260 12.921 14.720 15.084 21.762 19.697 7.301 15.502 21.663 13.338 24.384 11.126 N(EF ) 194 W.-Y. Ching 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 195 Fig. 10.2 Shear modulus versus bulk modulus for 665 screened MAX phases in the database. Solid circles and open circles are for carbides and nitrides respectively. Different color is used for different n in Mn+1 AXn . Also shown are the locations of other metals and binary MC and MN compounds Fig. 10.3 G/K ratio map for 792 MAX phases according to “M” (Y-axis) and “A” (X-axis) elements. Top panel for carbides and lower panel for nitrides. Color in each cell represents the calculated G/K value as indicated in the color bar. A star in the box indicates that this phase has been confirmed. “+” means the phases is eliminated for elastic instability and “×” means the phase is screened out for thermodynamic instability or positive HoF We now present some of the results related to electronic structure. The density of states (DOS) at the Fermi level (EF ) or N(EF ) is one of the important electronic parameter for metallic systems. In MAX phases N(EF ) is a strong function of composition. Some values are close to zero, whereas others are quite large, depending on whether EF is located in the vicinity of the 3d or 4d orbitals of “M”. The calculated 196 W.-Y. Ching Fig. 10.4 Plot of DOS at Fermi level N(EF ) against total number of valence electrons per unit volume for the 665 MAX phases in the database. Solid symbols for carbides and open symbols for nitrides. Note the outlying nature of the data for the M = Sc with X = C MAX phases N(EF ) per unit cell is found to be reasonably correlated with the total valance electron number per volume (Nval (Å−3 ) as shown in Fig. 10.4. The total valance electron number is the sum of the formal valance electrons of individual atoms in the crystal. In general, larger Nval (Å−3 ) corresponds to larger N(EF ) as expected. Also, as n increases, (Nval (Å−3 ) increases, the slope of the data distribution decreases. Traditionally, it has been speculated but not rigorously proved that the existence of a local minimum (or pseudo gap) at the Fermi level in a metal or alloy signifies its structural stability [20]. While all the DOS for the MAX phases are available, it is not practical to present the DOS figures for all the phases. However, the relative magnitudes of N(EF ) and its decomposition into different atomic components for each phase is a valid descriptor for the electronic structure. Figure 10.4 shows that nitrides have larger N(EF ) values than the carbides. The Sc-based carbides, however, are a notable exception. They have significantly higher N(EF ) than their nitride counterparts. The Sc-based carbides (but not the nitrides) also show a marked deviation from the general trend of N(EF ) versus (Nval (Å−3 ) with increasing n. The approximate positive linear correlation between the two properties becomes more pronounced with increasing n which can be attributed to the increased M atoms as n increases. One can relate this linear trend to a similar behavior observed 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 197 in the binary mono-carbides or nitrides [21]. However, a real distinction here with respect to the MAX phases is the profound role of “A” which is not present in the binary mono-carbides/nitrides. In general, the presence of the “A” element appears to significantly lessen the degree of the linear correlation or increase in the scattering of the data (see Fig. 10.4 for MAX 211). It is only at higher n values, where the “A” content is reduced and consequently the bonding characteristics are less influenced by “A”, that a stronger correlation between the two properties emerges, mimicking those of binary mono-carbides/nitrides. 10.3.3 Classification of Descriptors from the Database and Correlation Among Them The MAX database consists of the calculated quantities of all MAX phases as illustrated in Tables 10.1 and 10.2 shown earlier. These numerical quantities can be classified into descriptors or controlling factors to be used in data mining algorithms for materials informatics and to explore their correlations. A simple flow chart is shown in Fig. 10.5. For the MAX phases, we classify the descriptors into three categories based on their level complexity and/or computational time required to obtain the data: (1) Basic chemistry descriptors: They are (i) number of valence electrons, (ii) atomic number Z of the elements, and (iii) volume of the unit cell. (2) Descriptors from the electronic structure and bonding: They are: (i) Total bond order density (TBOD) (normalized to crystal volume), (ii) total bond order from Fig. 10.5 Flow chart of the approach used for data mining in materials informatics for MAX phases 198 W.-Y. Ching different atomic pairs, or the M–M, M–A, M–X, A–X and X–X pairs, (iii) density of states at the Fermi level N(EF ). (3) Descriptors for the elastic constants Cij and the bulk mechanical parameters, K, G, E, η and G/K ratio. We then seek to establish the correlations between these descriptors which are interrelated. More specifically, correlations between elastic and mechanical descriptors, correlations between electronic descriptors and correlation between mechanical descriptors and electronic descriptors [11]. We have been able to demonstrate that over 90 % correlation can be achieved using a simple linear regression method implying that the mechanical properties descriptors can be adequately represented by the other two types of descriptors. This is illustrated in the following section. 10.3.4 Verification of the Efficacy of the Materials Informatics Tools The success in linking the electronic structure factors to complex bulk elastic properties has enabled us to advance the utility of the data mining approach for expanding the materials database for MAX phases. We have also extended our analysis into the components of the second order elastic constants of the MAX phases. Figure 10.6 shows an example of such an analysis as applied to the 2-1-1 MAX carbides, a large subset of our database. The three main elastic constants in the database, C11 , C33 , C13 , which were calculated using an ab initio method are compared to those same values as predicted by a combination of electronic structure factors and valence electron information with reasonably high correlation coefficients of 0.83, 0.93, 0.95 respectively. This suggests that such a method is robust enough to probe orientation dependent second-order elastic constants with a high accuracy. (a) 350 300 250 200 150 100 100 150 200 250 300 350 400 From ab-initio MD calculation (in GPa) (c) 250 450 450 400 C 33 From data mining prediction (in GPa) 400 (b) C 11 From data mining prediction (in GPa) From data mining prediction (in GPa) 450 350 300 250 200 150 100 100 150 200 250 300 350 400 From ab-initio MD calculation (in GPa) 450 C 13 200 150 100 50 0 0 50 100 150 200 250 From ab-initio MD calculation (in GPa) Fig. 10.6 Comparison of a C11 , b C33 , and c C13 of 211 MAX carbides obtained from ab initio calculations (x-axis) and those from the data mining prediction (y-axis) 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 199 Result of linear regression of C11 , C33 , and C13 with chemical and electronic descriptors shown in Fig. 10.6 are as follows: C11 = 0.6235×ZM +8.1344×(GN)M −0.8737×ZX +32.0051×QA −144.2461× QX + 10.9223 × (BO)M−X + 9.2461 × (BO)M−A − 7.8791 × (BO)M−M + 11.8688 × (BO)A−A + 470.6772 × (BO)A−X − 2.7405 × N(EF ) + 243.5997. Correlation coefficient = 0.83 C33 = 0.7155×ZM +19.8291×(GN)M −1.085×ZX +18.9992×(GN)X +18.4127× (BO)M−A −8.9407×(BO)M−M +16.2802×(BO)A−A −1.0634×N(EF )+38.8039. Correlation coefficient = 0.9264. C13 = 36.994×(GN)M −0.402×ZX −7.3952×(GN)X +67.8729×QA −12.8243× (BO)M−X + 15.916 × (BO)M−A + 7.6037 × (BO)M−M − 33.8152 × (BO)A−A − 11.1306. Correlation coefficient = 0.9541. where ZM = atomic number of M, (GN)M = group number of M from Periodic Table, ZX = atomic number of X, (GN)X = group number of X, (BO)M−A = total bond order of M–A pairs, (BO)M−X = total bond order of M–X pairs, (BO)M−M = total bond order of M–M pairs, (BO)A−X = total bond order of A–X pairs, (BO)A−A = total bond order of A–A pairs, N(EF ) = DOS at the Fermi level. QA and QX are the effective charges on A and X respectively (see [11] for more details). Figures 10.7 and 10.8 show a different way to test the efficacy of the data mining algorithm. Here, we use 50 % of randomly chosen data from the database as a training set and use the existing algorithm WEKA [22] to predict the properties of the other 50 % of the MAX phases by comparing predicted values with the ab initio data in the database. Figure 10.7 (top panel) shows the comparison between K obtained from ab initio calculations versus those obtained from the formulas derived from the data mining algorithm for the other 50 % for the 211, 312, 413 and 514 MAX phases. An excellent correlation with over 90 % correlation factor for each type of the MAX phase is obtained. The lower panel of Fig. 10.7 also shows pie charts of the relative Fig. 10.7 Top panel Use of 50 % of MAX data for bulk modulus K as training set to predict the other 50 % by comparing with ab initio data for 665 MAX phases. Lower panel relative contribution from different electronic structure descriptors 200 W.-Y. Ching Fig. 10.8 Same as Fig. 10.7 but for the shear modulus contribution from each type of the electronic structure descriptors used to predict K. The four most important factors are the total bond order density (TBOD), BOD of the M–A pairs (M–A BOD), BOD of the M–X pairs (M–X BOD) and charge transfer for the X elements. TBOD clearly stands out as the most important factor in determining K for all MAX phases. Figure 10.8 shows a similar prediction for G/K ratio using the same procedure. Although the correlation factor is less impressive than for K, the prediction from data mining is still reasonably good, with correlation coefficients around 80 % or higher. The prediction for Poisson’s ratio is at the same level as for the G/K ratio. In both cases, they are strongly affected by the TBOD although it is a negative correlation instead of the positive correlation exhibited by K. The linear correlation is less definitive in this case, which is probably due to the fact that the G/K ratio or Poisson’s ratio are more influenced by the nature of the “A” element. This is an effect that apparently is not fully represented by the BO parameters. Nevertheless, a reasonably good estimate for these two properties can be established solely from a linear combination of electronic structure factors. Furthermore, the TBDO emerges as a significant descriptor that controls the mechanical properties. This data mining approach also demonstrates that a simple correlation can be used to link elastic 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 201 parameters such as Poisson’s ratio or the Pugh ratio to a series of electronic structure indicators. The use of only 50 % of the data as a training set gives credence to the particular machine learning software and the philosophy lying behind it. 10.4 Further Applications of MAX Data Since the generation of the MAX database less than a year ago, additional results that use this data to estimate the lattice thermal conductivity of MAX phases at high temperature and the calculation of universal elastic anisotropy based on a recently developed theory have been obtained. These are prime examples of the utility of large database to easily get new information without lengthy calculations consistent with the spirit of materials informatics. They are briefly described below. 10.4.1 Lattice Thermal Conductivity at High Temperature A systematic calculation of the lattice thermal conductivity κ ph and minimum thermal conductivity κmin for the 211, 312, and 413 MAX phases using Slack’s equation and the Clarke formula respectively has been carried out [23]. The parameters used in these simplified calculations are extracted from the elastic coefficients Cij , bulk mechanical properties, and equilibrium volume of all stable MAX phase compounds in the database. Essentially, the calculation of κ ph , follows the equation derived by Slack [24], κ ph = A M̄θ D3 δ γ 2 n 2/3 T (10.1) where M is the average atomic weight (in units of kg/mol), δ is the average volume of the mass equivalent to one atom in the primitive cell (in units of m3 ), T is the absolute temperature, n is the number of atoms per unit cell, γ the Grüneisen constant derived from Poisson’s ratio (υ) and A is a coefficient (in units of W mol/kg/m2 /K 3 ) that depends on γ as determined by Julian [25]. These parameters can be obtained from the database for the MAX phases. Figure 10.9 shows the calculated κ ph for the 211, 312, and 413 phases of MAX carbides and nitrides presented in two separate ways in order to trace the trends associated with variations in the atomic numbers of the “M” and “A” elements. The top panel in Fig. 10.9 is for “M”-based plots and the lower panel is for “A”-based plots. The x-axis lists the 9 “M” elements and 11 “A” elements for the top and bottom panels respectively. To grasp the variations and overall trends in κ ph more easily, we employ the following strategy: (1) the data for the carbides (solid circles) and nitrides (open circles) are plotted on the same figure. (2) The horizontal x-axis is arranged in the order of increasing atomic number Z. They are (Sc, Ti, V, Cr, Zr, Nb, Mo, 202 W.-Y. Ching 20 Sc Ti V Carbides Nitrides 15 10 5 0 (d) Al Si P S Ga Ge As In Sn Tl Pb 312 Carbides Nitrides 15 10 5 0 Cr Zr Nb Mo Hf Ta 211 20 20 Sc Ti V Cr 312 Carbides Nitrides 10 5 Al Si P S Ga Ge As In Sn Tl Pb 20 413 (f) Carbides Nitrides 15 10 5 0 Zr Nb Mo Hf Ta 15 0 (e) κph (1300K) (Wm-1K-1) 5 κph (1300K) (Wm-1K-1) 10 (c) κph (1300K) (Wm-1K-1) κph (1300K) (Wm-1K-1) Carbides Nitrides 15 0 (b) 211 κph (1300K) (Wm-1K-1) κph (1300K) (Wm-1K-1) (a) 20 Sc Ti V Cr Zr Nb Mo Hf Ta 20 413 Carbides Nitrides 15 10 5 0 Al Si P S Ga Ge As In Sn Tl Pb Fig. 10.9 Scatter plots of calculated phonon thermal conductivity (κ ph ) at 1300 K of MAX phases: a 211 in “M” trend; b 211 in “A” trend; c 312 in “M” trend; d 312 in “A” trend; e 413 in “M” trend, f 413 in “A” trend. The trend for “M” elements (Sc, Ti, V, Cr, Zr, Nb, Mo, Hf and Ta) and “A” elements (Al, Si, P, S, Ga, Ge, As, In, Sn, Tl and Pb) are along the x-axis in upper and lower panels, respectively. Each differently colored subpanel contains 22 and 18 MAX phases for the top and bottom respectively Hf, Ta) for “M” and (Al, Si, P, S, Ga, Ge, As, In, Sn, Tl, Pb) for “A”. Further, each column is separated by vertical blocks of differently shaded colors. Each colored area encloses 22 MAX phases with different “A” in the upper panel and 16 MAX phases with different ‘M’ in the lower panel. (3) The ordering of both ‘A’ and ‘M’ in each block is in the order of increasing Z. The two panels contain the same number of data points but they are plotted in different ways to facilitate the observation of trends. (4) The vertical scale (0 to 20 Wm−1 K−1 ) for κ ph is kept the same for easy comparison. Despite the overwhelming amount of data, the following trends and observations can be easily discerned with this creative display: (a) The MAX phases with five highest κ ph at 1300 K are all nitrides. (b) The Sc-based MAX phases (first panel to the left in the upper panel) in 211 phases are more widely dispersed than those from the 312 and 413 phases. The κ ph for carbides are much smaller than those of nitrides. For the other M panels, nitrides have a lower κ ph than carbides except in the 211 MAX phases where they are mixed. There is also the obvious trend of reduced κ ph as Z increases. This observation is much more pronounced in the variation of “A” for a given “M” than on the variation of “M” for a given “A”. Thus, variation in “A” is the major controlling factor for κ ph . They also show more distinctive separations between carbides and nitrides. The trend of reduced κ ph as the Z value of “M” increases is much less pronounced. This is contrary to the notion that “M” should be more influential than “A” since there are more “M” elements in a given MAX phase. (c) In the panels for different “A”, data for carbides and nitrides are rather scattered. Within each panel for a fixed “A”, the trends of decreasing κ ph with increasing Z value of “M” is much less pronounced in striking contrast with 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 203 that data for the variation of “A” for a fixed “M”. (d) Data for κ ph are more widely distributed and have larger values in 211 phases. They generally scale inversely with layer index n. The calculated data at 1300 K shown in Fig. 10.9 are in reasonable agreement with the only available experimental data on eight MAX phases (Ti2 AlC, Nb4 AlC3 , Ta4 AlC3 , Nb2 AlC Nb2 SnC, Ta2 AlC, Cr2 AlC, and Ti3 SiC2 ) [12]. We used the same data to estimate the intrinsic minimum thermal conductivity κmin in MAX phases which is the lowest value of thermal conductivity for perfect crystals at high temperature when phonons are completely uncoupled and energy is transferred between neighboring atoms over the Debye temperature [24]. According to the simple theory advanced by Clarke, κmin is given by [26]:  κ = kB vm min = kB vm min M n ρ NA −2/3 (10.2) where min , NA , and ρ are the phonon mean free path, Avogadro’s constant, and crystal density respectively. The calculated κmin values are consistent with those using (10.1) at T = 2000 K. 10.4.2 Universal Elastic Anisotropy in MAX Phases Recently, Ranganathan and Ostoja-Starzewski [27] developed a new theory for the universal elastic anisotropy AU for all types of crystals. This gives a single parameter to quantify the crystal anisotropy similar to the Zener anisotropy index [28] which is applicable only to cubic crystals. AU is given by: AU = 5(GV /GR ) + (KV /KR ) − 6. (10.3) Here K and G are the bulk and shear moduli and the superscripts V and R stand for the Voigt and Reuss approximations [29, 30], respectively. The Voigt (Reuss) approximation assumed a uniform strain (stress) distribution throughout the structure. These two assumptions give the upper and lower limits of bulk mechanical properties and the averaged value of these two limits is the Hill approximation [31] which is usually the value used to compare with measured data. AU must be positive, and is usually much less than 2.0, and seldom goes beyond 4.0 [27]. AU = 0 implies zero anisotropy. The large data base of elastic coefficients for MAX phases is ideally suited to evaluate AU , test the new theory, and to ascertain its efficacy when applied to a single class of ternary hexagonal compounds, the MAX phases. We have recently calculated the universal elastic anisotropy of the 665 MAX phases according to (10.3) [32]. Figure 10.10 shows the scattered plot of AU versus the total bond order density (TBOD) which we advocate as the single most important metric for the electronic part of the properties. Here, AU is supposed to be a single parameter that describe the anisotropy in mechanical properties for a given crystal. As can be seen, the majority of the MAX phases have a low AU of less than 0.5 although 204 W.-Y. Ching 3.5 Sc Ti Zr Hf V Nb Ta Cr Mo 3.0 2.5 Au 2.0 1.5 1.0 0.5 0.0 0 02 0. 5 02 0. 0 03 0. 5 03 0. 0 04 0. 5 04 0. 0 05 0. TBOD Fig. 10.10 Universal elastic anisotropy (AU ) versus total bond order density (TBOD) for 665 MAX carbides and nitrides in the database. There is evidence of a bimodal distribution with a minimum in AU corresponding to a TBOD near 0.035 Fig. 10.11 Universal elastic anisotropy (AU ) maps for 792 MAX phases according to “M” (y-axis) and “A” (x-axis) elements. The description is the same as for Fig. 10.4 for the G/K map some phases have AU greater than 1.0. There is no apparent difference between MAX carbides and MAX nitrides in the AU distribution, but there is evidence of a bimodal distribution of the data, i.e. there is a broad minimum in the middle range of the TBOD. The implication of this interesting result has yet to be explored. Figure 10.11 shows the AU map for all the MAX phases according to “M” (Y-axis) and “A” (X-axis) elements similar to Fig. 10.4. Color in the square cell represents calculated AU values as indicated in the color bar. Again, a star in the box indicates that the phase has been confirmed. The symbol “+” stands for elastic instability and “×” indicates that the phase is screened out for positive heat of formation. This map 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 205 clearly shows with a glance that most of the MAX phases have low AU close to 0.1–0.5 and that all of the confirmed phases have low AU . Few of the phases with high AU are easily identified. Once more, we used this innovative map of AU to easily and clearly show the general trends in universal elastic anisotropy according to this new theory and to identify some isolated MAX phases as outliers. 10.5 Extension to Other Materials Systems 10.5.1 MAX-Related Systems, MXenes, MAX Solid Solutions, and Similar Layered Structures A newly discovered class of 2D materials, labeled MXenes, presents a unique opportunity for the development of exceptional functional properties with diverse applications [33, 34]. MXenes are anisotropic laminated transition metal compounds derived from predecessor MAX phases. Very recently, a family of 2D materials was derived from MAX phases by extracting the A element in MAX. These so-called MXenes (Mn+1 Xn Tx ), have their surface terminated by Tx (O, OH, or F). Only a few MXenes out of a large number of possibilities have been reported. Some of these MXenes have high electric conductivities with hydrophilic surfaces making them ideal for applications as electrodes in Li-ion batteries or Li-ion capacitors. Demonstration of spontaneous intercalation of cations of different sizes and charges (Li+ , Na+ , K+ , Mg2+ , Al3+ , (NH4 )+ , N2 H4 ) between 2D Ti3 C2 Tx surfaces in various salt solutions presents a great opportunity for material tunability. Such a wide range of choices of cation intercalations and the rich chemistry of the functionalized surfaces offer truly unique functional applications beyond the limits of even the most advanced 2D structures. The underlying factors governing the design and syntheses of MXenes however is largely unknown at atomistic-scales. Thus, MXenes are an ideal system to apply the techniques used in materials informatics similar to the ones we used for the MAX phases in creating a database for Mn+1 Cn , Mn+1 Nn , and Mn+1 ATx . Another area is to consider extending MAX phases to their solid solutions. This offers a much larger variation in composition range and in the fine tuning of the desired properties. MAX solid solutions can be formed by partial substitutions in “M”, “A” elements or between C and N. There have not been many accurate calculations on the MAX solid solutions because such calculations require significant computational resources. Solid solutions are no longer crystalline phases with welldefined long range order. They are essentially a class of disordered solids with random site substitutions. A large number of supercells must be used to properly describe the structure and property variations with composition x. Thus MAX solid solutions offer another great opportunity to apply the methods of materials informatics for extensive studies. The database for MAX phases and their various applications in data mining schemes demonstrated above can facilitate the effort to investigate MAX solid solutions. We have recently carried out a detailed investigation on one of the most 206 W.-Y. Ching important MAX solid solution phases, Ti2 Al(Cx N1−x ) [35]. For this solid solution, the mechanical properties vary continuously with x between Ti2 AlN and Ti2 AlC with no evidence of improved mechanical parameters beyond the end members. They do have subtle variations for x > 0.5 which are supported by some existing experimental observations. This does not rule out the possibility of strengthening MAX phases in other solid solutions via substitutions in “M” or “A” elements. In addition to MAX solid solutions, another route to significantly enlarge the database of MAX or MAX-like compounds is to consider the quaternary alloys by adding another metal element to create a new crystal structure with different crystal symmetry. A recent example of achieving this goal is the theoretical suggestion of a new compound (Cr 2 Hf)2 Al3 C3 [36]. The crystal structure and elastic properties of this MAX-like compound are studied using similar computational methods as in the MAX phases. Unlike MAX phases which have a hexagonal symmetry (space group: P63 /mmc, #194), (Cr 2 Hf)2 Al3 C3 crystallizes in the monoclinic structure with a space group of P21 /m (#11) with lattice parameters a = 5.1739 Å, b = 5.1974 Å, c = 12.8019 Å; α = β = 90◦ , γ = 119.8509◦ . The calculated total energy per unit cell for this crystal is found to be energetically more favorable than potential competing phases. The calculated total energy per formula unit of −102.11 eV is significantly lower than those of the allotropic segregation (−100.05 eV) and solid solution phases (−100.13 eV). Calculations using a stress versus strain approach and the VRH approximation for polycrystals show that (Cr 2 Hf)2 Al3 C3 has outstanding elastic moduli, better than Cr 2 AlC or Hf 2 AlC. Obviously, this approach can be used to explore many more new phases with some exotic properties. It is probably premature to apply techniques of materials informatics to study quaternary MAX-like compounds at this stage unless new creative algorithms can be designed. 10.5.2 CSH-Cement Crystals Cement materials represent another system that is highly amenable for materials informatics. They are very complicated in both composition and structure but have well defined industrial standards for desirable attributes in properties. Most of all, they are of great importance and relevance to the construction industry, the environment, and the world economy. Calcium silicate hydrate (CSH) is the main binding phase of Portland cement, the single most important structural material in use worldwide. Due to the complex structure and chemistry of CSH, accurate computational studies at the atomic level are almost non-existent. Recently, we studied the electronic structure and bonding of a large subset of the known CSH minerals [37]. Table 10.3 lists the 20 CS and CSH crystal phases with well-documented atomic positions used in this study. They are divided into four groups according to the Strunz scheme [38]: a, Clinker and hydroxide phase; b, nesosubsilicates; c, sorosilicates; and d, ionosilicates. Each group in Table 10.3 is arranged in ascending order of calcium to silicon (C/S) ratio. The clinker phases (a.1 and a.2 with no H) and the Portlandite (a.3 with no Si) are Clinker/Hydroxide Belite Alite Portlandite Nesosubsilicates Afwillite α-C2SH Dellaite Ca Chondrodite Sorosilicates Rosenhahnite Suolunite Kilchoanite Killalaite Jaffeite Inosilicates Nekoite T11 Å T14 Å T9Å (a) a.1 a.2 a.3 (b) b.1 b.2 b.3 b.4 (c) c.1 c.2 c.3 c.4 c.5 (d) d.1 d.2 d.3 d.4 Mineral Name 2.00 3.00 N/A 1.50 2.00 2.00 2.50 1.00 1.00 1.50 1.60 3.00 0.50 0.67 0.83 0.83 Ca3 (SiO3 OH)2 ·2H2 O Ca2 (HSiO4 )(OH) Ca6 (Si2 O7 )(SiO4 )(OH)2 Ca5 [SiO4 ]2 (OH)2 Ca3 Si3 O8 (OH)2 Ca2 [Si2 O5 (OH)2 ]H2 O Ca6 (SiO4 )(Si3 O10 ) Ca6.4 (H0.6 Si2 O7 )2 (OH)2 Ca6 [Si2 O7 ](OH)6 Ca3 Si6 O15 ·7H2 O Ca4 Si6 O15 (OH)2 ·5H2 O Ca5 Si6 O16 (OH)2 ·7H2 O Ca5 Si6 O17 5H2 O M:P1 O: P 21/b 21/c 21/a Tc:P-1 M: P 1 1 21/b Tc:P-1 O: F d 2 d O: I 2 c m M: P 1 21/m 1 Tg: P 3 Tc: P1 M: B 1 1 m M: B 1 1 b Tc: C1 Ca/Si 2(CaO) SiO2 3(CaO) SiO2 Ca(OH)2 Chemical formula O: P 1 21/n 1 M:P-1 M:P-3 m 1 Sym/space group 2.204 2.396 2.187 2.579 2.874 2.649 2.937 2.838 2.595 2.929 2.828 2.590 2.693 3.316 3.084 2.668 ρ(g/cc) 0.0240 0.0248 0.0226 0.0247 0.0241 0.0255 0.0209 0.0217 0.0192 0.0210 0.0194 0.0243 0.0216 0.0226 0.0197 0.0177 TBOD Table 10.3 List of 20 CS and CSH crystals divided into four groups based upon Strunz classification as discussed in our previous study [37] (continued) 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 207 d.5 d.6 d.7 d.8 Wollastonite Xonotlite Foshagite Jennite Mineral Name Table 10.3 (continued) Sym/space group Tc: P-1 Tc: A-1 Tc: P-1 Tc:P-1 Ca3 Si3 O9 Ca6 Si6 O17 (OH)2 Ca4 (Si3 O9 )(OH)2 Ca9 Si6 O18 (OH)6 ·8H2 O Chemical formula 1.00 1.00 1.33 1.50 Ca/Si ρ(g/cc) 2.899 2.655 2.713 2.310 TBOD 0.0224 0.0214 0.0211 0.0223 208 W.-Y. Ching 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 209 placed in group a. Portlandite is included in this group because it forms the basis for hydration of cement. Our results reveal a wide range of contributions from each type of bonding, especially the hydrogen bonding. We find that the total bond order density (TBOD) is again an ideal overall metric for assessing crystal cohesion of these complex materials and should replace the conventionally used Ca/Si ratio. A rarely known orthorhombic phase Suolunite is found to have higher cohesion (TBOD) in comparison to Jennite and Tobermorite, which are considered to be the backbone of hydrated Portland cement [37, 39, 40]. Obviously the crystalline CSH phases listed in Table 10.3 can be greatly expanded to include additional elements such as Al in the Ca-Si-Al-hydrates. A large database of cement crystals similar to the MAX phases can be built for materials informatics to design new construction materials which are more economical, environmentally friendly, and durable. This is another example of using the TBOD as a proper descriptor for materials design. 10.5.3 Extension to Other Materials Systems: Bulk Metallic Glasses and High Entropy Alloys Another promising system for materials informatics are bulk metallic glasses (BMGs) [41] and the related high entropy alloys (HEAs) [42]. Metallic glasses are a special class of non-crystalline solid that are completely different from crystalline metals due to the lack of long-range order. They have many excellent properties and significant potential as next-generation structural materials. However, there is a lack of fundamental understanding about the structure and dynamics of BMGs at the atomic and electronic level despite many years of intense research. Many of the fundamental issues in BMGs require accurate data that can only be obtained by firstprinciples calculations. Detailed information about the atomic-scale interactions and their implications on the short-range and medium range orders are still missing. Current research efforts appear to focus mostly on the geometrical analysis of structures to explain the mechanical properties, deformation behavior, glass forming ability, etc. We again advocate for the use of TBOD from high quality electronic structure calculations as a useful theoretical metric to characterize the overall properties of a BMG which can be correlated with glass forming ability and other physical properties. The challenges we face are the requirement for both the accuracy and the size of BMG models and the large number of models that are needed to reach valid conclusions. Most conventional BMGs are either binary (e.g. Zr x Cu1−x and Nix Nb1−x ) or ternary alloys such as Zr x Cuy Alz . However, there are BMGs with more than 3 or 4 components such as such as Zr 41.2 Ti13.8 Cu12.5 Ni10.0 Be22.5 (Vitreloy) [43]. In these multi component BMGs, accurate ab initio modeling is sine qua non because any classical molecular dynamic simulation are infeasible due to lack or inadequacy 210 W.-Y. Ching of appropriate potentials. The dependence of BMGs on the specific composition requires a large number of calculations to validate any hypothesis. High-entropy alloys (HEAs) represent another class of systems which are ideal for using a materials informatics approach. Unlike the traditional alloys based on the principal elements (Fe, Ni, Cu, Ti, Zr, Al etc.) as the matrix, HEA is essentially an n-component alloy system with 5 ≤ n ≤ 13. The % of major (minor) component Xi (Xs) where we have 5 % ≤ Xi ≤ 35 % and Xs is ≤ 5 %. High-entropy implies high n. The compositional possibilities for HEAs are almost unlimited. They have attracted a great deal of attention in recent years as replacements for traditional alloys such as Ni3 Al, Ti3 Al which have reached their ultimate limit of materials performance. Many new applications in different industrial and medical areas require alloys with special properties such as high hardness and strength at high temperature, resistance to wear and oxidation, low thermal conductivity, special magnetic properties, and easy formation of nanoparticles. The main effects offered by HEA are thermodynamics (the high entropy effect), the dynamic effect, the lattice distortion effect due to different sizes of the elements, and the effect due to interatomic interactions (the so-called cocktail effect). A major difference between HEAs and BMGs is that the underlying structure for HEAs is crystalline, mostly in fcc lattice or a mixture of fcc and bcc lattice even though both HEAs and BMGs are disordered alloys. HEAs are more suitable for the systematic application of materials informatics tools because the structural part of the alloy is much simpler to model than in BMGs. On the other hand, the challenge is the enormous number of compositional possibilities which will make the database extremely large. 10.6 Conclusions In this Chapter, we have discussed the construction and analysis of a large database for a unique class of materials, MAX phases, and we have articulated a specific approach for using ab initio data for materials informatics. What we have learned is that materials informatics is extremely useful but also that it faces a lot of challenges. Our approach will need a large amount of computational resources depending on the systems to be studied, but creative planning and targeted application together with the ways and means the data are presented are very important. A data mining approach can be very effective for accelerating the database generation as exemplified by the MAX phase study. The selective process of establishing internal links amongst the potential descriptors is the key. We have also found that the total bond order density (TBOD) is a very useful descriptor for analyzing a variety of properties and in their interpretations. We also described several other material systems that can employ the similar approach for materials informatics research because they share some common attributes with the MAX phase and they also have well-defined descriptors. 10 Materials Informatics Using Ab initio Data: Application to MAX Phases 211 Acknowledgments I acknowledge with thanks the contributions and assistance from Drs. Sitaram Aryal, Yuxiang Mo and Liaoyuan Wang; Professors Michel W. Barsoum, Ridwan Sakidja, and Paul Rulis; Mr. Chamila C. Dharmawardhana, and Mr. Chandra Dhakal. This work was supported by the National Energy Technology Laboratory (NETL) of the U.S. Department of Energy (DOE) under Grant No. DE-FE0005865. This research used the resources of the National Energy Research Scientific Computing Center (NERSC) supported by the Office of Basic Science of DOE under Contract No. DE-AC03-76SF00098. References 1. V. Vapnik, The Nature of Statistical Learning Theory (Springer Science & Business Media, New York, 2000) 2. K. Rajan, Materials informatics. Mater. Today 8(10), 38–45 (2005) 3. R.F. Service, Science materials scientists look to a data-intensive future. Science 335(6075), 1434–1435 (2012) 4. P. Jiang, X.S. Liu, Big data mining yields novel insights on cancer. Nat. Genet. 47(2), 103–104 (2015) 5. P.V. Balachandran, S.R. Broderick, K. Rajan, Identifying the ‘inorganic gene’ for hightemperature piezoelectric perovskites through statistical learning. Proc. R. Soc. A 467, 2271– 2290 (2011) 6. M. Nishijima et al., Accelerated discovery of cathode materials with prolonged cycle life for lithium-ion battery. Nat. Commun. 5 (2014) 7. J. Carrete et al., Finding unprecedentedly low-thermal-conductivity half-Heusler semiconductors via high-throughput materials modeling. Phys. Rev. X 4(1), 011019 (2014) 8. Y. Saad et al., Data mining for materials: computational experiments with AB compounds. Phys. Rev. B 85(10), 104104 (2012) 9. A.W. Bosse, E.K. Lin, Polymer physics and the materials genome initiative. J. Polym. Sci. Part B: Polym. Phys. 53(2), 89 (2015) 10. S. Broderick et al., An informatics based analysis of the impact of isotope substitution on phonon modes in graphene. Appl. Phys. Lett. 104(24), 243110 (2014) 11. S. Aryal et al., A genomic approach to the stability, elastic, and electronic properties of the MAX phases. Phys. Status Solidi (b) 251(8), 1480–1497 (2014) 12. M.W. Barsoum, MAX Phases: Properties of Machinable Ternary Carbides and Nitrides (Wiley, New York, 2013) 13. S.F. Pugh, XCII. Relations between the elastic moduli and the plastic properties of polycrystalline pure metals. Lond. Edinb. Dublin Philos. Mag. J. Sci. 45(367), 823–843 (1954) 14. Y. Mo, P. Rulis, W.Y. Ching, Electronic structure and optical conductivities of 20 MAX-phase compounds. Phys. Rev. B 86(16), 165122 (2012) 15. L. Wang, P. Rulis, W.Y. Ching, Calculation of core-level excitation in some MAX-phase compounds. J. Appl. Phys. 114, 023708 (2013) 16. J. Hafner, J. Furthmüller, G. Kresse, Vienna Ab-initio Simulation Package (VASP) (1993), http:// www.vasp.at/ 17. M. Born, K. Huang, Dynamical Theory of Crystal Lattices (Clarendon Press, Oxford, 1956) 18. W.Y. Ching, P. Rulis, Electronic Structure Methods for Complex Materials: The Orthogonalized Linear Combination of Atomic Orbitals. (Oxford University Press, Oxford, 2012) p. 360 19. R. Ahuja et al., Structural, elastic, and high-pressure properties of cubic TiC, TiN, and TiO. Phys. Rev. B 53(6), 3072–3079 (1996) 20. S.R. Nagel, J. Tauc, Nearly-free-electron approach to the theory of metallic glass alloys. Phys. Rev. Lett. 35(6), 380–383 (1975) 21. M.W. Barsoum, MAX Phases: Properties of Machinable Ternary Carbides and Nitrideds (Wiley-VCH, Weinheim, 2013) 212 W.-Y. Ching 22. M. Hall et al., The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009) 23. C. Dhakal, R. Sakidja, S. Aryal, W.Y. Ching, Calculation of lattice thermal conductivity of MAX phases. J. Eur. Ceram. Soc. 35(12), 3203–3212 (2015) 24. D.T. Morelli, G.A. Slack, High lattice thermal conductivity solids, High Thermal Conductivity Materials (Springer, Berlin, 2006), pp. 37–68 25. C.L. Julian, Theory of heat conduction in rare-gas crystals. Phys. Rev. 137(1A), A128 (1965) 26. D.R. Clarke, Materials selection guidelines for low thermal conductivity thermal barrier coatings. Surf. Coat. Technol. 163, 67–74 (2003) 27. S.I. Ranganathan, M. Ostoja-Starzewski, Universal elastic anisotropy index. Phys. Rev. Lett. 101(5), 055504 (2008) 28. C. Zener, Elasticity and Anelasticity of Metals (University of Chicago press, Chicago, 1948) 29. W. Voigt, Lehrbuch Der Kristallphysik (mit Ausschluss Der Kristalloptik). 1928: B.G. Teubner 30. A. Reuss, Berechnung der Fließgrenze von Mischkristallen auf Grund der Plastizitätsbedingung für Einkristalle. ZAMM—J. Appl. Math. Mech./Zeitschrift für Angewandte Mathematik und Mechanik 9(1), 49–58 (1929) 31. R. Hill, The elastic behaviour of a crystalline aggregate. Proc. Phys. Soc. Sect. A 65(5), 349 (1952) 32. C.C. Dharamawardhana, W.Y. Ching, Universal Elastic Anisotropy in MAX Phases (unpblished) 33. M.R. Lukatskaya et al., Cation intercalation and high volumetric capacitance of twodimensional titanium carbide. Science 341(6153), 1502–1505 (2013) 34. M. Naguib et al., 25th Anniversary article: MXenes: a new family of two-dimensional materials. Adv. Mater. 26(7), 992–1005 (2014) 35. S. Aryal, R. Sakidja, L. Ouyang, W.-Y. Ching, Elastic and electronic properties of Ti2 Al(C1−x Nx) solid solutions. J. Eur. Ceram. Soc. 35(12), 3219–3227 (2015) 36. Y. Mo, S. Aryal, P. Rulis, W.Y. Ching, Crystal structure and elastic properties of hypothesized MAX phase-like compound (Cr2Hf)2Al3C3. J. Am. Ceram. Soc. 97(8), 2646–2653 (2014) 37. C.C. Dharmawardhana, A. Misra, W.Y. Ching, Quantum mechanical metric for internal cohesion in cement crystals. Sci. Rep. 4, 7332 (2014) 38. H. Strunz, Mineralogische Tabellen (Akad Verl.-Ges. Geest u Portig, Leipzig, 1982) 39. C.C. Dharmawardhana et al., Role of interatomic bonding in the mechanical anisotropy and interlayer cohesion of CSH crystals. Cem. Concr. Res. 52, 123–130 (2013) 40. I.G. Richardson, The calcium silicate hydrates. Cem. Concr. Res. 38, 137–158 (2008) 41. J. Schroers, Bulk metallic glasses. Phys. Today 66(2), 32–37 (2013) 42. J.W. Yeh et al., Nanostructured high-entropy alloys with multiple principal elements: novel alloy design concepts and outcomes. Adv. Eng. Mater. 6(5), 299–303 (2004) 43. A. Peker, W.L. Johnson, A highly processable metallic glass: Zr41. 2Ti13. 8Cu12. 5Ni10. 0Be22. 5. Appl. Phys. Lett. 63(17), 2342–2344 (1993) Chapter 11 Symmetry-Adapted Distortion Modes as Descriptors for Materials Informatics Prasanna V. Balachandran, Nicole A. Benedek and James M. Rondinelli Abstract In this paper, we explore the application of symmetry-mode analysis for establishing structure-property relationships. The approach involves describing a distorted (low-symmetry) structure as arising from a (high-symmetry) parent structure with one or more static symmetry-breaking structural distortions. The analysis utilizes crystal structure data of parent and distorted phase as input and decomposes the distorted structure in terms of symmetry-adapted distortion-modes. These distortionmodes serves as the descriptors for materials informatics. We illustrate the potential impact of these descriptors using perovskite nickelates as an example and show that it provides a useful construct beyond the traditional tolerance factor paradigm found in perovskites to understand the atomic scale origin of physical properties, specifically how unit cell level modifications correlate with macroscopic functionality. 11.1 Introduction One of the common objectives in the paradigm of materials informatics is the robust formulation of structure-property relationships. In materials informatics, normally, the “properties” of interest (e.g. Curie temperature, melting point, tensile strength, ductility, conductivity, polarization, hysteresis etc.) that we intend to optimize are well defined. However, what constitutes a “structure” is often not clear a priori and remains an outstanding issue. Note that in this paper, we restrict the scope of the P.V. Balachandran (B) Theoretical Division, Los Alamos National Laboratory, Los Alamos 87545, USA e-mail: pbalachandran@lanl.gov N.A. Benedek Department of Materials Science and Engineering, Cornell University, Ithaca 14853, USA e-mail: nab83@cornell.edu J.M. Rondinelli Department of Materials Science and Engineering, Northwestern University, Evanston 60628, USA e-mail: jrondinelli@northwestern.edu © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_11 213 214 P.V. Balachandran et al definition of “structure” to crystal structures, i.e. spatial arrangement of atoms in one, two or three-dimensions. Generally, the phenomenologically-derived descriptors (also referred to as features), e.g. Shannon’s ionic radius, Pauling’s electronegativity, pseudopotential radii of atomic orbitals and Pettifor’s chemical scale (to name a few), are utilized at a coarse-grained level to represent local or crystal structures and crystal chemistries. There are a number of reports in the literature, where structure-property relationships have been formulated for bulk materials using these descriptors that have even led to the discovery of new materials [1–6]. Although successful, these descriptors lack some desired characteristics. For example, when using these descriptors it is difficult to separate two materials that have the same chemical formula but different crystal symmetries (unless one appends the crystal symmetry as a separate feature). Similarly, owing to the remarkable progress in achieving high-quality coherent thin films, heterostructures and superlattices, heteroepitaxial synthesis has evolved into a reliable strategy to engineer new materials. In such ultra-thin films, strain fields at the thin film-substrate interface directly tune the local electronic states from which novel functionalities and phases prohibited or are absent in bulk materials are stabilized. Once again, the aforementioned descriptors fail under these contexts. Clearly, there is a need to develop more refined descriptors that carry physically relevant information, so that the constructed structure-property relationships not only merely reflect statistical correlations, but also provide avenues to probe mechanistic insights for better understanding. And this is not a trivial task. Recently, computational codes based on ab initio [7–9] and classical methods [10] have also been explored for descriptor development. This approach is desirable, because these approaches contain the essential physics and enable rigorous materials modeling, which are absent in the phenomenological descriptors. Having said that, the cost of running expensive computer simulations on large systems could prove prohibitive and it is important to be wary of this shortcoming. 11.2 Distortion Modes as Descriptors In this paper, we focus on developing descriptors based on distortion-mode decomposition analysis (or symmetry-mode analysis) that provide a rigorous basis set for studying crystal structures. Particularly, these descriptors are best-suited for problems (e.g. ferroelectricity, piezoelectricity, shape memory effect, ion transport etc.,) in condensed matter systems that rely on structure-based materials design. In such materials, the ability to deterministically control local atomic structure would enable tuning many important electronic and structural functionalities, in turn critical for technological applications. In fact, one of the common themes is that these materials show some form of symmetry-breaking structural phase transitions and/or local structural distortions [e.g. cooperative atomic displacements (also known as “shuffles”), lattice strain, coupling between shuffles and strain]. 11 Symmetry-Adapted Distortion Modes as Descriptors for Materials Informatics 215 Symmetry-mode analysis involves describing a distorted (low-symmetry) structure as arising from a (high-symmetry) parent structure with one or more static symmetry-breaking structural distortions. In the undistorted parent structure, symmetry-breaking distortion-modes have zero amplitude. The low-symmetry phase, however, will have finite amplitudes for each mode described by an irreducible representation (irrep) of the high-symmetry structure compatible with the symmetry breaking that are defined relative to specific k-points [11]. Additional details regarding distortion-mode decomposition analysis may be found in the literature [12–15]. The distortion-mode analysis is powerful, because it provides a complete and systematic basis to isolate multiple and complex distortions in crystals. By comparison of the amplitude of various modes, it is possible to directly assess each modes contribution to the mechanism underlying a structural and electronic phase transition. Furthermore, the distortion-mode analysis relies solely on crystal structure data, which enables both bulk and thin film stabilized structures with identical compositions to be evaluated on equal footing [16]. What is of particular utility in formulating quantitative structure-property relationships using distortion modes is that each irrep carries a physical representation of the displacive distortions—the unique atomic coordinates describing various symmetry-adapted structural modes. The relative importance of these modes on properties may then be mapped by means of ab initio computational methods or via detailed and systematic experimentation. Accessibility to computational methods make the distortion-mode analysis powerful, because it is possible to independently study various distortions and directly assess their role in structural and electronic phase transition mechanisms and macroscopic properties. Note that such direct comparison is not possible through aggregate parameters widely followed in the literature such as the tolerance factor, ionic radius, or electronegativity, i.e., when the composition is fixed. The physical basis that supports the usage of distortion-modes for materials informatics is grounded in Landau theory [17], where the free energy of a crystalline solid undergoing a phase transition from a high-symmetry parent phase to a low-symmetry distorted phase can be expressed in terms of one or more order parameters. In this paper, we discuss the implications of symmetry-mode analysis as descriptors for materials informatics based on the perovskite structure class of materials. One of the motivations for choosing perovskites and oxides is based on the works of Benedek and Fennie [18] and Cammarata and Rondinelli [19], who used a combination of symmetry arguments and first-principles calculations to explore the connection between structural distortions and materials functionality. We have extended these guidelines to the family of perovskite nickelates (originally not considered by Benedek and Fennie [18]), where we uncover the meaning of the metric “tolerance r A +r O , where r A , r B and r O are the Shannon’s ionic radii [20] of A, factor” (t = √2(r B +r O ) B and Oxygen elements in the ABO3 chemical formula) and show that it encodes information that pertain to a set of the key distortion modes. It is this intriguing connection between t and distortion modes that makes t such an informative descriptor for capturing key structural and chemical trends in nickelates. Furthermore, we also show that t does not account for all distortion modes present in the ground state structure. 216 P.V. Balachandran et al 11.3 Perovskite Nickelates The structure of perovskite oxides (see Fig. 11.1) are characterized by a threedimensional network of corner-connected metal–oxygen octahedra, with alkali, alkaline-earth or lanthanide elements filling holes in the body centers of the octahedral network. These nickelate oxides exhibit non-trivial changes in structure and physical properties, including sharp first-order temperature-driven metal to insulator transitions, unusual antiferromagnetic order in the ground state, and site- or bond-centered charge disproportionation owing to the valence and spin state flexibility of the Ni3+ cation [21]. Furthermore, it has been shown that both the electronic and magnetic transition temperatures can be modified by applying epitaxial strain when these materials are grown as thin films [22, 23]. In Fig. 11.2, we show the rare earth (R) cation-temperature phase diagram of bulk RNiO3 nickelate perovskites. In our earlier work [15], we focused on two important characteristics in the phase diagram: the (i) metal to insulator transition temperature (TMI ) and the (ii) paramagnetic to antiferromagnetic phase transition temperatures (TN , Néel Temperature). We uncovered key distortion modes and statistical correlations that govern the temperatures of the two phase transitions. One of the important + findings is that R+ 3 and M5 irreps capture the TN trends that were previously unknown and the implications are that these distortions are not encoded in the widely recog- Fig. 11.1 Crystal structure of an ideal cubic perovskite showing three-dimensional octahedral BO6 connectivity with A-site cations filling the holes of the octahedral network 11 Symmetry-Adapted Distortion Modes as Descriptors for Materials Informatics 217 Fig. 11.2 Rare earth cation–temperature phase diagram of RNiO3 perovskite nickelates [21] nized t metric. Note that in RNiO3 , the rNi is fixed; therefore, t for RNiO3 is equivalent to r R (i.e. the ionic size of the trivalent rare earth ion). Our objective here is defined as follows, can we use informatics to uncover the physical meaning of t in terms of the distortion modes? We address this question by building a data set of distortion modes of known RNiO3 perovskites (see the work of Balachandran and Rondinelli [15] for additional details on symmetry-mode analysis). We used a total of 10 RNiO3 compounds for our analysis, where R = La, Nd, Pr, Tm, Lu, Dy, Er, Y, Ho and Yb. Except LaNiO3 , all other nickelates were considered in the experimental ground state monoclinic P21 /c structure; we used the rhombohedral R 3̄c ground state structure for LaNiO3 . 11.3.1 Statistical Correlation Analysis In Fig. 11.3, we show that the irreps are statistically correlated with one another, indicating that the distortions occur cooperatively in bulk nickelates. A strong positive correlation is found between irreps that describe distortions to the NiO6 octahedra: + + + + M+ 2 , M3 , X5 , R4 , and R5 and TMI . These five irreps fully describe the Pnma (space group # 62) crystal structure relative to the cubic phase found in the metallic nickelates at high temperature, reinforcing the concept that the orthorhombic distortions are largely responsible for the electronic bandwidth-controlled transport behavior in nickelates. Our analyses also reveal the existence of a strong linear relationship + between TN and two irreps, R+ 3 and M5 . The linear relationship is valid for both + TMI = TN and TMI > TN nickelates, indicating that R+ 3 and M5 contain additional 218 P.V. Balachandran et al Fig. 11.3 Statistical correlation plot showing the positive (blue) and negative (red) pairwise correlation between distortion modes + + + + + (M+ 2 , M3 , X5 , R4 , R5 , R1 , + and M ), T and T R+ MI N. 3 5 © 2014 Reproduced with permission of the American Physical Society from [15] information not captured by either the conventional Ni–O–Ni angle or tolerance factor descriptors. Although there are eight algebraically independent irreps necessary to decompose the monoclinic P21 /c phase, the presence of statistical correlation suggests redundancy—meaning, we can further reduce the complexity of the dataset and transform the statistically correlated irreps as linear combinations of one another. We used principal component analysis (PCA) to accomplish this objective. 11.3.2 Principal Component Analysis (PCA) PCA is one of the well known linear data-dimensionality reduction methods [24]. PCA assumes that the dataset consists of a large number of intercorrelated descriptors that lie on a linear manifold. The purpose here is to reduce the dimensionality of a data set, while retaining maximum variability. This is achieved by transforming the original set of variables to a new set of derived variables, called the principal components (PCs), which are ordered so that the first few retain the most of the variation present in all of the original variables. The first PC accounts for the maximum variance (highest eigenvalue) in the dataset; the second PC is orthogonal to the first and accounts for most of the remaining variance. Thus, the mth PC is orthogonal to all others and has the mth largest variance in the set of PCs. Once all the PCs have been calculated, only those with eigenvalues above a critical level (a rule of thumb is to retain only those PCs whose eigenvalue is greater than or equal to 1) are retained. Each PC is a linear combination of the weighted contribution of all attributes and the magnitude of the weight determines the relative impact of each descriptor in 11 Symmetry-Adapted Distortion Modes as Descriptors for Materials Informatics 219 80 % variance explained 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 number of principal components Fig. 11.4 Scree plot showing the relative variance of each principal component (PC) from the RNiO3 data set. The first three PC’s together capture more than 95 % variance in the data set. After the third PC (circled), the relative variance captured by the subsequent PC’s are small and can be ignored affecting the PC. From the knowledge of the calculated PCs, one can determine the relative importance of each descriptor and the correlation between any two descriptors. Information pertaining to the relative importance of the descriptors will be helpful in identifying the dominant descriptors, whereas the correlation information will be helpful in screening the dominant descriptors to avoid choosing redundant descriptors. Here, we find that the first three PCs together capture 95 % of variation in the dataset. The scree plot is shown in Fig. 11.4. As a result, we have reduced the dimensionality of the dataset from 8 to 3. Therefore, we retain only the first three PCs for further consideration. In fact, the first two PC’s alone capture 91 % of variation in the data. In 11.1 and 11.2 , we show the weighted contribution from the linear combination of irreps captured by PC1 and PC2, respectively. PC1 = −0.38R1+ − 0.12R3+ − 0.41R4+ − 0.42R5+ − 0.38X5+ − 0.41M2+ − 0.42M3+ − 0.06M5+ (11.1) PC2 = 0.17R1+ + 0.66R3+ − 0.17R4+ − 0.10R5+ − 0.12X5+ − 0.01M2+ − 0.06M3+ − 0.69M5+ (11.2) Note that PC1 and PC2 captured 68 and 23 % of variation, respectively, in the dataset. PC1 captures descriptors associated with octahedral distortions that describe the orthorhombic crystal symmetry (Pnma) and also note that there is a significant contribution coming from the R+ 1 irrep that describes the octahedral breathing distortion, which is the primary order parameter for the phase transition from the paramagnetic metallic Pnma structure to the paramagnetic insulating monoclinic P21 /c structure. 220 P.V. Balachandran et al tolerance factor (t) 0.94 0.92 0.9 0.88 0.86 -0.8 -0.6 -0.4 -0.2 principal component 1 0 -0.3 -0.2 -0.1 0 principal component 2 Fig. 11.5 Scatter plot between tolerance factor (t, y-axis) and left principal component 1 and right principal component 2. We find that principal component 1 correlates strongly with t (R 2 =0.90), relative to principal component 2 (R 2 =0.74) + On the other hand, in PC2 R+ 3 (Jahn-Teller distortion) and M5 (out-of-phase tilting) distortions have the dominant contribution, but are orthogonal to PC1. One of the active areas of research in perovskite nickelates is to identify the mechanism responsible for the metal-to-insulator and the paramagnetic to antiferromagnetic phase transitions. Clearly, the insights hidden in the PC1 and PC2 should be rigorously explored using additional experimentation and theoretical simulations to elucidate the physical origin behind these correlations. In Fig. 11.5, we show how PC1 and PC2 relate to the t. Clearly PC1 (Fig. 11.5a) correlates strongly with t with a correlation coefficient (R 2 ) of 0.9, relative to PC1 (Fig. 11.5b) whose R 2 is only modest at 0.74. Note that Fig. 11.5 also includes LaNiO3 , whose ground state structure is R 3̄c; in sharp contrast with other nickelates whose ground state is P21 /c. The key implications are that the octahedral + + + + distortions (in terms of M+ 2 , M3 , X5 , R4 , and R5 ) that describe the Pnma symme+ try and the breathing distortion (R1 ) together collectively correlate strongly with the t metric. The full description of PC1 is given in 11.1. On the other hand, octahedral + distortions described by irreps R+ 3 and M5 (see 11.2) do not correlate strongly with t, indicating that the geometric t metric is much less sensitive to electronic-based structural effects such as Jahn-Teller distortions. 11.4 Summary In summary, descriptor development is a critical component in the materials informatics research paradigm. The choice of the descriptions must be such that, in addition to helping accomplish statistical correlations between structure and property, it must provide mechanistic insights to address causal relationships. Distortion modes based on symmetry-mode analyses satisfy these requirements, which makes it very attractive for developing quantitative structure-property relationships in materials informatics. 11 Symmetry-Adapted Distortion Modes as Descriptors for Materials Informatics 221 Acknowledgments P.V.B. acknowledges funding support from the Los Alamos National Laboratory (LANL) Laboratory Directed Research and Development (LDRD) DR (#20140013DR) on Materials Informatics. J.M.R. acknowledges funding support from the NSF (DMR-1454688). References 1. E.S. Machlin, T.P. Chow, J.C. Phillips, Structural stability of suboctet simple binary compounds. Phys. Rev. Lett. 38, 1292–1295 (1977) 2. J.R. Chelikowsky, J.C. Phillips, Quantum-defect theory of heats of formation and structural transition energies of liquid and solid simple metal alloys and compounds. Phys. Rev. B 17, 2453–2477 (1978) 3. P.B. Littlewood, Structure and bonding in narrow gap semiconductors. Crit. Rev. Solid State Mater. Sci. 11(3), 229–285 (1983) 4. A. Zunger, Systematization of the stable crystal structure of all AB-type binary compounds: a pseudopotential orbital-radii approach. Phys. Rev. B 22, 5839–5872 (1980) 5. T.R. Paudel, A. Zakutayev, S. Lany, M. d’Avezac, A. Zunger, Doping rules and doping prototypes in A2 BO4 spinel oxides. Adv. Funct. Mater. 21(23), 4493–4501 (2011) 6. P.V. Balachandran, S.R. Broderick, K. Rajan, Identifying the inorganic gene for hightemperature piezoelectric perovskites through statistical learning. Proc. R. Soc. A: Math. Phys. Eng. Sci. 467(2132), 2271–2290 (2011) 7. J. Yan, P. Gorai, B. Ortiz, S. Miller, S.A. Barnett, T. Mason, V. Stevanovic, E.S. Toberer, Material descriptors for predicting thermoelectric performance. Energy Environ. Sci. 8, 983–994 (2015) 8. B. Meredig, C. Wolverton, Dissolving the periodic table in cubic zirconia: data mining to discover chemical trends. Chem. Mater. 26(6), 1985–1991 (2014) 9. L.M. Ghiringhelli, J. Vybiral, S.V. Levchenko, C. Draxl, M. Scheffler, Big data of materials science: critical role of the descriptor. Phys. Rev. Lett. 114, 105503 (2015) 10. T. Das, T. Lookman, M.M. Bandi, A minimal description of morphological hierarchy in twodimensional aggregates. Soft Matter 11, 6740–6746 (2015) 11. B.J. Campbell, H.T. Stokes, D.E. Tanner, D.M. Hatch, ISODISPLACE: a web-based tool for exploring structural distortions. J. Appl. Crystallogr. 39(4), 607–614 (2006) 12. C.J. Howard, H.T. Stokes, Group-theoretical analysis of octahedral tilting in perovskites. Acta Crystallogr. Sect. B 54(6), 782–789 (1998) 13. J.M. Perez-Mato, D. Orobengoa, M.I. Aroyo, Mode crystallography of distorted structures. Acta Crystallogr. Sect. A 66(5), 558–590, (2010). http://dx.doi.org/10.1107/S0108767310016247 14. D. Orobengoa, C. Capillas, M.I. Aroyo, J.M. Perez-Mato, AMPLIMODES: symmetry-mode analysis on the bilbao crystallographic server. J. Appl. Crystallogr. 42(5), 820–833 (2009) 15. P.V. Balachandran, J.M. Rondinelli, Interplay of octahedral rotations and breathing distortions in charge-ordering perovskite oxides. Phys. Rev. B 88, 054101 (2013) 16. I.C. Tung, P.V. Balachandran, J. Liu, B.A. Gray, E.A. Karapetrova, J.H. Lee, J. Chakhalian, M.J. Bedzyk, J.M. Rondinelli, J.W. Freeland, Connecting bulk symmetry and orbital polarization in strained RNiO3 ultrathin films. Phys. Rev. B 88, 205112 (2013) 17. J. Tolédano, P. Tolédano, The Landau Theory of Phase Transitions. (World Scientific, Singapore, 1987) 18. N.A. Benedek, C.J. Fennie, Why are there so few perovskite ferroelectrics?. J. Phys. Chem. C 117(26), 13339–13349 (2013) 19. A. Cammarata, J.M. Rondinelli, Contributions of correlated acentric atomic displacements to the nonlinear second harmonic generation and response. ACS Photonics 1(2), 96–100 (2014) 20. R.D. Shannon, Revised effective ionic radii and systematic studies of interatomic distances in halides and chalcogenides. Acta Crystallogr. Sect. A 32, 751–767 (1976) 21. G. Catalan, Progress in perovskite nickelate research. Phase Trans. 81, 729–749 (2008) 222 P.V. Balachandran et al 22. J. Chakhalian, J.M. Rondinelli, J. Liu, B.A. Gray, M. Kareev, E.J. Moon, N. Prasai, J.L. Cohn, M. Varela, I.C. Tung, M.J. Bedzyk, S.G. Altendorf, F. Strigari, B. Dabrowski, L.H. Tjeng, P.J. Ryan, J.W. Freeland, Asymmetric orbital-lattice interactions in ultrathin correlated oxide films. Phys. Rev. Lett. 107, 116805 (2011) 23. J.M. Rondinelli, S.J. May, J.W. Freeland, Control of octahedral connectivity in perovskite oxide heterostructures: an emerging route to multifunctional materials discovery. MRS Bull. 37, 261–270 (2012) 24. M. Ringnér, What is principal component analysis? Nat. Biotech. 26(3), 303–304 (2008) Chapter 12 Discovering Electronic Signatures for Phase Stability of Intermetallics via Machine Learning Scott R. Broderick and Krishna Rajan Abstract In this paper, we identify the signatures of the density of states (DOS) spectra which control the bulk modulus via a hybrid informatics driven analysis. The signatures of the DOS spectra then constitute the electronic structure fingerprint of the material. This provides an important step in the “inverse design” process because if we are able to compute bulk modulus from the DOS, then we can also compute the DOS from the bulk modulus, and in this way create a “virtual” DOS based on optimized properties. In this paper, we identify the signatures for bulk modulus, and associate the signatures with specific chemistry and crystal structure. Further, we identify the details in the electronic structure that result in Ni3 Al and Co3 Al having such different stabilities in L12 structure although they are seemingly isoelectronic. This paper lays out the methodology for extracting these features and has significant implications, such as in the identification of critical element substitutions, by developing a framework for accelerated and targeted materials design. 12.1 Introduction This paper develops a template for “inverse design” of alloy chemistries, which we demonstrate for the density of states and bulk modulus of a material. The questions we are asking here is: (i) if we know the target property for a material we want, can we from that compute the chemistry and structure of the material and (ii) what signatures of the density states spectra dictate very different structural stabilities of seemingly isoelectronic systems? These issues represent an inverse logic to traditional materials design, such as through density functional theory (DFT), S.R. Broderick · K. Rajan (B) Department of Materials Design and Innovation, University at Buffalo—The State University of New York, Buffalo, NY, USA e-mail: krajan3@buffalo.edu © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_12 223 224 S.R. Broderick and K. Rajan where the input is the chemistry and structure, and the output is property and stability [1–4]. This provides an alternate approach to other linkages of DFT and informatics which seek to calculate properties for as many materials as possible, and provide a database that can be searched for the material closest to having the target properties [5]. We instead start with a relatively small database and require few, but clearly defined, DFT calculations. This logic further differs from the traditional definition of inverse design in materials science by going from property to condition, as opposed to defining inverse design as going from calculation to experiment [6, 7]. We have previously employed informatics for modeling the density of states (DOS) spectra as a function of the properties of constituent elements [8]. This work represented a new approach for rapidly modeling DOS spectra with an accuracy nearly equivalent to DFT calculations. Further, our prior work also demonstrated the capability of informatics for extracting signatures of the DOS spectra correlating to chemistry, crystal structure and stoichiometry [9, 10]. These prior works therefore introduced an approach for modeling DOS spectra based on modifications in the material chemistry and structure. This paper develops the next stage in the inverse design problem of connecting bulk modulus and density of states. That is, if we can extract modulus from the DOS spectra, then we can design a “virtual” DOS spectra which is optimized based on our property requirements. When connected with our prior works in connecting chemistry and DOS [8], and connecting crystal structure to DOS [9, 10], the framework is completed for going from target property to “virtual” DOS to crystal structure and chemistry, and therefore the computation of a “virtual” material with the target properties. As the DOS represents all electronic interactions of a system, it theoretically contains information on all electronic properties [11–15]. However, the understanding of how these properties are captured by the DOS is not well understood. Therefore, another objective of this paper is in understanding the connection between DOS and property. That is, to identify what signatures related to the intensities of the DOS are controlling the material property. One example of a property which is known to be at least qualitatively represented within the DOS spectra for single elements is bulk modulus [16]. The Fermi energy (EF ) indicates the maximum occupancy by electrons at ground state conditions, with DOS values at energies greater than EF representing unoccupied available states, while the transition from bonding to antibonding states is represented as a well-defined and extended valley [17]. Occupancy of a bonding state corresponds with an increase in strength, while additional occupancy of an antibonding state results in a decrease in strength. The bulk modulus can then be found to be related with the distance between the bonding-antibonding transition and EF . We expand that logic here but for alloy systems. 12.2 Informatics Background and Data Processing The ability to predict properties of new alloy systems from an input of elemental DOS requires the integration of principal component analysis (PCA) and partial least 12 Discovering Electronic Signatures for Phase Stability of Intermetallics … 225 squares (PLS). This work represents a hybrid approach because we are predicting the property as a summation of the PLS coefficients and the PCA weightings (i.e. Property = f[(Component of PLS result * Component of PCA result)] as opposed to considering the two components independently (i.e. Property = f[(Component of PLS result)* (Component of PCA result)]. That is, in the final development of an equation, the PLS and PCA components of the analysis cannot be separately extracted. This hybrid capacity of the approach is demonstrated in this paper. The PCA serves to extract the unique patterns within the DOS spectra most correlated to the information discriminating the materials. A dimensionally reduced map can then be used to correlate the conditions of the materials to the signatures of the DOS spectra. The primary application however in this paper is the parameterization of the DOS spectra, where the parameterization is not based on a curve fitting, but rather by correlating the conditions of the material with the features. PLS is then used to develop a predictive model between these PCA derived parameters and material property, in the form of a quantitative structure-property relationship (QSPR). The input DOS curves were calculated using the full-potential linearized augmented plane wave (FP-LAPW) method [18] within the density functional theory (DFT) [19] approach and implemented in the WIEN2K code [20]. The exchangecorrelation term was determined within the generalized gradient approximation (GGA) using the scheme of Perdew and Wang [21]. DFT is based on the discovery that a relationship exists between the potential of a system and the electronic density and is able to model the electronic structure based on the relationships between these factors. The input into a DFT calculation is the chemistry and relative atom positions, and using quantum mechanical approximations DFT is able to model the electronic structure. Although the calculation is based on a k-space representation and structure is not directly involved in the calculation, we have previously shown using statistical learning that crystal structure is clearly represented within the DOS spectra [9, 10]. PCA classifies the data based on a set of orthogonalized axes (principal components) comprised of a combination of descriptors which maximize the variance in the data captured [22–25]. By applying PCA to the DOS spectra, the strongest patterns in the data can be identified in a limited number of dimensions. PCA operates by performing an eigenvector decomposition of the data. As such, the principal components (PCs) capturing the most information are associated with the largest eigenvalues of the covariance matrix and their corresponding eigenvectors. The original data is decomposed into two matrices: the scores (T ) and loadings (P). The scores matrix classifies the samples, in this case different alloy chemistries and structures, as defined by their differences in the DOS. The loadings matrix contains information on how the different descriptors (here DOS at specific energy values) differentiate the samples. The PCA equation is summarized by (12.1), where E is the residual matrix, and X is the input data matrix. X = T · PT + E (12.1) The loadings and scores matrices contain the principle patterns within the DOS curves and the scaling of those patterns to create the final DOS curve, respectively. 226 S.R. Broderick and K. Rajan Fig. 12.1 The development of each row of the input matrix (X in (12.1)), demonstrated for Ni3 Al. Each row of X contains a unique alloy chemistry and structure. The columns of the input matrix contain every data point in the DOS curve, the rows contain different alloy systems, and the value in the matrix is the DOS, or intensity, at the specified energy. The DOS is first normalized by dividing every value by the maximum DOS value for the alloy, and then the mean of all DOS spectra at each respective energy is calculated and subtracted from the normalized spectra for each alloy. This processed (normalized and mean centered) DOS for each alloy is added as a separate row in the input matrix The number of dimensions of T and P may be equal to the number of data points within the entire DOS curve and is on the order of hundreds of PCs in this case, although typically a significantly reduced number of PCs is sufficient for capturing the information of interest. The treatment of the input data is demonstrated in Fig. 12.1, and defines how the data from the DFT calculations is processed prior to being included in X . As an example, we show the DOS of Ni3 Al in the L12 structure. As our primary objective here is the extraction of patterns in the DOS spectra, we first normalize the DOS spectra by dividing all points by the maximum DOS value, and thereby the largest DOS value becomes unity. Then the mean for each energy value, across all DOS spectra included in the analysis, is calculated, with the DOS aligned 12 Discovering Electronic Signatures for Phase Stability of Intermetallics … 227 so that the Fermi energy (EF ) is equal to zero. This mean spectra is then subtracted from the DOS spectra. This processed DOS spectra is then used in the PCA analysis. The DOS spectra shown as the informatics input in Fig. 12.1 represents an entire row in matrix X from (12.1). This process is repeated on every system included in the analysis. In PLS the training data is converted to a data matrix with orthogonalized axes, which are based on capturing the maximum amount of information in fewer dimensions [26–30]. The relationships discovered in the training data can be applied to a test dataset based on a projection of the data onto a high-dimensional hyperplane within the orthogonalized axis-system. Typical linear regression models do not properly account for the co-linearity between the descriptors, and as a result the isolated impact of each descriptor on the property cannot be accurately known. However, by projecting the data onto a high-dimensional space defined by orthogonal axes which are comprised of a linear combination of the spectral parameters defining the DOS curves, the impact of the descriptor on the property can be identified independent of all other descriptors. PLS is used here to predict the relationship between spectral features and bulk modulus for different alloy chemistries. The prediction serves to create a connection between chemistry, electronic structure, and property. The PLS prediction requires two input matrices: a matrix which contains descriptors related to the input conditions (scores matrix) and a matrix containing the values which are to be predicted (bulk modulus), building a model between the input descriptors and the descriptor to be predicted. To ensure accuracy of the QSPR modeling and to verify that we are not over-fitting the data, we employ a cross validation to the predicted results. To this end, we compute both the root mean square error of calibration (RMSEC) and the root mean square error of cross validation (RMSECV). We perform a leave-one-out (LOO) cross validation and measure the accuracy of the model with and without the variable left out in the LOO approach. This step is repeated for removing each sample from the training data. That is, a model is built removing each sample, thereby ensuring that the physics captured in the model development is sufficiently robust that it can be used on new materials. The RMSEC and RMSECV values are then used to define the final predictive model. To select the number of latent variables with a suitable combination of accuracy and robustness, we define a criteria for selection of latent variables based on the ratio of RMSECV (m)/RMSECV (m + 1) where m is equal to the number of latent variables. From our criteria, m is selected such that it is the maximum number with the ratio below the threshold value of unity. To ensure that we are not over-fitting the data, a minimal number of parameters (PC scores values) are included so that the number of alloy chemistries is sufficiently larger than the number of parameters used as terms in the QSPR. The systems that were modeled via DFT and are used in the analysis are listed in Table 12.1, with the bulk modulus (B) values also provided. The crystal structure type for each alloy is shown in parentheses. 228 S.R. Broderick and K. Rajan Table 12.1 List of alloys modeled via DFT (crystal structure in parentheses) and used as input systems in this work Alloy B (GPa) CuZn (B2) CoTi (B2) Ni3 Al (L12) NiAl (B2) CoAl (B2) Co3 Al (L12) NiTi (B2) Be3 Co (D03) TiAl (L10) Fe3 Ni (L12) FeNi3 (L12) FeNi (L10) FePd3 (L12) Fe3 Pd (L12) FePd (L10) 113.8 177.5 177.2 189.0 177.5 123.9 159.0 156.5 112.5 141.0 189.1 183.5 192.3 155.9 179.1 The modulus values are also listed, as calculated via DFT 12.3 Informatics-Based Parameterization of the DOS Spectra The analysis described here represents a general methodology, and can therefore adapt with additional systems. As additional systems are added, the spectral patterns and subsequently the parameters will change. This flexibility in the approach is one of the primary benefits because it is robust enough to represent changes in systems or more range of possible systems. The output from the PCA is then spectral parameters for systems which we know the bulk modulus, and spectral parameters for systems which we do not know the bulk modulus. The parameters for the systems with the bulk modulus known are input into the PLS approach, and a model linking the spectral patterns and bulk modulus is then developed. This model can then be used to predict the bulk modulus of the new systems as a function of their spectral parameters. This logic is demonstrated here for bulk modulus, but should be applicable to any electronic property. The results of the PCA on the DOS spectra is provided in Figs. 12.2 (scores plot) and 12.3 (loadings spectra). In the scores plot, we are able to extract some trends correlating to the new axis system. The first is that PC2 captures the crystal structures of the alloys, with those having L12 structure having positive PC2 and those having negative PC2 value being a structure besides L12 . PC3 captures subtleties in the DOS spectra correlated with chemistry, which we observe through Co and Ni containing alloys trending towards positive PC3 and those with Fe and Pd trending towards 12 Discovering Electronic Signatures for Phase Stability of Intermetallics … 229 Fig. 12.2 PCA scores mappings of the alloys based solely on a DOS input. These values correspond with matrix T in (12.1). From these mappings of the first three PCs, we identify trends corresponding with crystal structure, chemistry, and electron valency. Our information based parameterization therefore captures variances related to these various factors which will then be represented in the development of the QSPR. The axes of the plots are defined by the loadings plots of Fig. 12.3 negative PC3. The physics captured by PC1 is harder to define, although it largely captures the relationship with d-electron valency, as those alloys with elements without d-electrons (Al and Be) generally having lower PC1 values. Therefore, the PCA analysis is able to capture subtle variations in the DOS associated with crystal structure, chemistry, and valency. The loadings plot mathematically define the axes of the scores plots, with the axes a sum of the values at each energy but weighted according to the loadings values. Therefore, those features with larger loadings values more prominently define the axes. For example, in loadings spectra 1, the DOS at lowest energy have a negative correlation with PC1 scores value, while increasing DOS near EF increases the PC1 value, which is also relatively insensitive to the changes at higher energy values, as seen by the loadings values closer to zero. This loadings pattern fits with our interpretation of the PC1 axis, as those elements without d-orbitals will increase the DOS at lower energies, and therefore decrease the PC1 scores value. Similar interpretation can be applied to the other PCs, as is shown for PC2 in Fig. 12.4. In this case, we identify the features in the DOS which promote the L12 structure. 230 S.R. Broderick and K. Rajan Fig. 12.3 The three most significant spectral patterns for differentiating the DOS spectra. These spectra represent the first three rows of matrix P in (12.1). These spectra define the axes of Fig. 12.2, and also define the physics associated with our parameterization which will be connected to bulk modulus. Further, the features of the DOS can be correlated with material conditions, such as is demonstrated in Fig. 12.4 for crystal structure An issue when developing QSPRs is that we need the number of conditions (in this case unique chemistries and structures) to be well larger than the number of predictor variables (in this case the parameterization of the DOS spectra as represented through the PC scores values). Otherwise, the risk of over-fitting the model becomes high. To address this challenge, beyond employing the cross validation approach discussed in Sect. 12.2, we reduce the number of parameters included. The importance of each PC, as measured through variance, is listed in Table 12.2. We select six PCs as the number to include, as this represents a dimensionality lower than the number of conditions (fifteen), while capturing greater than 90 % of the total variance in the DOS spectra. 12 Discovering Electronic Signatures for Phase Stability of Intermetallics … 231 Fig. 12.4 Correlating signatures in the DOS spectra with the L12 crystal structure. In Fig. 12.2, we identified PC2 as separating L12 from other structures, and specifically with L12 structures having positive PC2 value. Increasing the DOS at the regions with positive loadings values increases the PC2 scores value, while increasing DOS at the regions with negative PC2 loadings decreases the scores value. Therefore, we find that those compounds with larger DOS values below and at the Fermi energy are more like L12 , while those with higher DOS above EF are less correlated with the L12 structure Table 12.2 Variance captured by each PC PC % Variance 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 62.6 11.2 6.4 4.6 4.3 2.8 2.1 1.9 1.3 0.8 0.8 0.5 0.4 0.2 0.1 0.0 % Total variance 62.6 73.8 80.1 84.8 89.0 91.8 93.9 95.8 97.1 98.0 98.8 99.3 99.7 99.9 100.0 100.0 In order to reduce the number of predictor variables used in developing our QSPR, we select only the first six PCs, which capture over 90 % of the information contained by the DOS spectra. We therefore include six parameters derived from the DOS in modeling bulk modulus. The information represented by the other PCs is moved to the residual matrix (E in (12.1)) 232 S.R. Broderick and K. Rajan Fig. 12.5 Residual spectra for Ni3 Al. This spectra represents the information captured in PC7 through PC16, and which is not included in modeling bulk modulus. By removing this residual signal, we have reduced noise or other information not contributing relevant information for modeling material properties for Ni3 Al. A comparable residual spectra is calculated for every alloy system in the analysis In (12.1), matrices T and P therefore have six dimensions each. The loadings spectra from the other PCs after the first six are put into the residual matrix (E in (12.1)). As an example of the information in the residual matrix, the row corresponding to Ni3 Al is shown in Fig. 12.5. This information is therefore further removed from the input spectra shown in Fig. 12.1 for Ni3 Al. We have decomposed the DOS spectra for each alloy into seven components (PC1 through PC6 and the residual spectra). The components corresponding to relevant signal for Ni3 Al are shown in Fig. 12.6, with each spectra corresponding to a different PC. The sum of these components then is the row corresponding to Ni3 Al in the matrix TPT in (12.1). This sum also is equal to the informatics input spectra from Fig. 12.1 minus the residual matrix from Fig. 12.5. The deconvolution to these six spectra is based on the processing of the DFT output, the PCA analysis and the removal of the residual signal. These spectra are then used to determine the parameters used in extracting the modulus. The parameter is calculated as ratio of these patterns to the respective loadings spectra. For example, T1 P1 T for Ni3 Al divided by the PC1 spectra of Fig. 12.3 results in a value of −3.94. Similar logic is used for calculating the other five parameters for Ni3 Al, and is also repeated for every alloy system. 12 Discovering Electronic Signatures for Phase Stability of Intermetallics … 233 Fig. 12.6 Ni3 Al spectra from the first six PCs. The parameters used in the modeling are then these spectra divided by the respective loadings spectra for each PC. The scaling value is then the parameter for that PC. For example, the six parameters resulting from Ni3 Al are −3.94, 3.43, 1.13, −3.16, −1.94 and 1.68. Parameters are computed in the same way for each alloy. The collection of these parameters are then the predictor matrix for extracting the bulk modulus 12.4 Identifying the Bulk Modulus Fingerprint Based on the predictor matrix we developed as described in Sect. 12.3, we develop a QSPR for the bulk modulus using PLS. The PLS prediction is based on correlating the features of the DOS (as represented through T) with the bulk modulus. The output of the PLS model is then a coefficient matrix β and a constant C, such that the bulk modulus is defined as in (12.2), where the bulk modulus (B) of material i is a function of the product of the PLS coefficient and the scores value corresponding to each PC j. The correlation between the input mean-centered DOS curve, with the DOS input for every energy k, is then defined by (12.3). Bi = 6 j=1 βj ∗ Ti,j + C (12.2) 234 S.R. Broderick and K. Rajan Fig. 12.7 The result of the hybrid approach for predicting bulk modulus from an input of alloy DOS spectra. The accuracy of the results demonstrates that bulk modulus is clearly represented within the DOS spectra, and that it can be quantitatively extracted via statistical learning Bi = 6 1000 j=1 k=1 βj Xi,k Pj,k + C (12.3) Based on (12.2) and (12.3), the terms of β were calculated as 1.09, −1.65, 1.27, −3.19, −2.77 and 2.48, in order of j from one to six. To test and ensure the robustness of the model, the cross-validation approach described in Sect. 12.2 was utilized. The result of the model comparing the informatics modeled B with that calculated via DFT is shown in Fig. 12.7. The high accuracy of this approach shows that we are capturing the features in the DOS which control the bulk modulus. Based on a comparison of the magnitudes of the coefficients in β, PC1 which has the fewest features impacts the bulk modulus the least. The other PCs, such as PC4, PC5 and PC6, which have higher weighting, also contain more features as seen in Fig. 12.6. For Co3 Al, the fourth and fifth patterns were most important (as determined by the value for parameter times coefficient) for determining the bulk modulus. To extract the features which most impact the modulus, we utilize (12.4), which also highlights the hybrid approach, as the PLS and PCA derived components are summed together based on the component number, and not using the two approaches individually. This therefore converts the single modulus value to a spectra with dimensionality equivalent to the number of values at unique energy intervals. The spectral values (Bi,k ) are a measure of the contribution of the DOS at each energy to the bulk modulus. We therefore develop spectra which correlate the features of the DOS with the modulus, as the features with largest intensity represent the signatures of the DOS which impact the modulus. (12.4) 12 Discovering Electronic Signatures for Phase Stability of Intermetallics … 235 Fig. 12.8 Identification of signatures of the DOS spectra for Ni-Al alloys in L12 and B2 structure. The circled regions define the energies where the largest magnitude features are. The corresponding features of the DOS spectra at those energies are then extracted. We notice for both structures, the bonding-anti-bonding transition is identified as a feature of the DOS corresponding to bulk modulus, and therefore we identify this transition as a signature of the Ni-Al alloys The comparison between the weighting on the bulk modulus and the DOS spectra is possible as they have the same energy values. Therefore, we can trace the energy corresponding to the signature back to the original input spectra. This is represented in Figs. 12.8 and 12.9, where we compared, respectively, the signature of Ni-Al alloys in different structures and L12 structures with different chemistries. In this way, we identify the signatures common to crystal structure and the features common to chemistry. The circles regions in these figures correspond to the highest intensity features in terms of contribution to the bulk modulus. The circled regions within the input DOS spectra are at the same energies and therefore identify the most important features of the DOS in terms of contribution to bulk modulus. In the case of changing chemistry, we identify similarities in terms of Ni-Al chemistries (structure-modulus relationship). We find that for both L12 and B2 structure, the bonding-anti-bonding transition is a prominent feature. Conversely, when comparing L12 structures but with different chemistries (chemistry-modulus relationships), we find that a doublet peak between larger peaks is identified in each case, with only the bonding-anti-bonding transition identified as a signature for Ni3 Al. We therefore have through this work identified two signatures of the DOS for Ni3 Al and used them to explain the differences in bulk moduli of alloys, as well as the differences in stability for alloys with similar electronic structures. We have also for Ni3 Al correlated one signature to structure and two signatures to chemistry. This result is summarized in Fig. 12.10. In the case of bonding-anti-bonding transition as 236 S.R. Broderick and K. Rajan Fig. 12.9 Identification of signatures of the DOS spectra of L12 structures for Ni3 Al and Co3 Al. The interpretation of this figure is the same as in Fig. 12.8. In this case, we identify the circled doublet peak in both, and therefore identify it as a signature of the L12 structure. This difference in features define the difference in stability for Co3 Al and Ni3 Al, which have similar electronic structures but very different stabilities in L12 structure Fig. 12.10 We identified two signatures of the DOS for Ni3 Al in the L12 structure which determine the bulk modulus. The first signature is the circled doublet peak which is due to the L12 structure, as described in Fig. 12.9. Further, the signature corresponding with the bonding-anti-bonding transition is due to the chemistry. The correlation with modulus, electronic signatures, crystal structure and chemistry provide a pathway for inverse design for chemical substitutions 12 Discovering Electronic Signatures for Phase Stability of Intermetallics … 237 a signature, this is not surprising as that is a factor in determining single element transition metal bulk modulus, as discussed in the introduction. However, the signature of the peaks correlating to L12 structure would not be identified otherwise. As discussed, engineering the intensity of these peaks leads to controlling the bulk modulus of the material. When combined with our prior work in connecting structure and chemistry to the DOS, we now have a template for multi-scale “inverse design” of new alloys. 12.5 Summary In this paper, we developed a hybrid informatics approach for extracting bulk modulus from the DOS spectra and identifying subtle features in the DOS spectra which dictate differences in stability of electronically similar alloys. By connecting property and DOS spectra, we can now develop “virtual” DOS which correspond to target property, thereby representing an inverse design approach where we start with the property and calculate the material, as opposed to the traditional approach. The approach developed here first extracted parameters based on the comparison of the DOS spectra with the signals corresponding to material characteristics. The modulus was then modeled based on the quantitative relationship between the spectral weightings and the property, thus developing electronic structure-crystal structureproperty relationships. The natural extension of this work is predicting the influence of alloying additions on DOS and the use of our approach as a means for searching for stability of multicomponent systems without doing large numbers of first principles calculations, as well as rapidly exploring the role of rare earth additions compared to non-rare earth additions in terms of electronic structure. Acknowledgments We acknowledge support from NSF grant no. DMR-13-07811 and Air Force Office of Scientific Research grant no. FA9550-12-1-0456. KR acknowledges support from the Erich Bloch Endowed Chair-University at Buffalo: The State University of New York. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. C.C. Fischer et al., Nat. Mater. 5, 641 (2006) S.V. Dudiy, A. Zunger, Phys. Rev. Lett. 97, 046401 (2006) G.H. Jóhannesson et al., Phys. Rev. Lett. 88, 255506 (2002) C.E. Mohn, W. Kob, Comput. Mater. Sci. 45, 111 (2009) A. Jain, S.P. Ong, G. Hautier, W. Chen, W.D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, K.A. Persson, Appl. Phys. Lett. Mater. 1(1), 011002 (2013) A. Zakutayev et al., J. Am. Cer. Soc. 135, 10048–11054 (2013) L. Yu et al., Adv. Energy Mater. 3, 43–48 (2013) S.R. Broderick, K. Rajan, Europhys. Lett. (EPL) 95, 57005 (2011) S.R. Broderick, H. Aourag, K. Rajan, Stat. Anal. Data Min. 1, 353 (2009) S.R. Broderick, H. Aourag, K. Rajan, J. Am. Ceram. Soc. 94, 2974 (2011) 238 S.R. Broderick and K. Rajan 11. 12. 13. 14. 15. 16. 17. 18. M. Finnis, Interat. Forces Condens. Matter (Oxford University Press, Oxford, 2003) N.W. Ashcroft, N.D. Mermin, Solid State Physics (Brooks/Cole, Boston, 1976) J.R. Alvarez, P. Rez, Acta Mater. 49, 795 (2001) W. Zhou, H. Wu, T. Yildirim, Phys. Rev. B (Condens. Matter Mater. Phys.) 76, 184113 (2007) S.F. Matar, M.A. Subramanian, Mater. Lett. 58, 746 (2004) L. Cheng-Bin et al., Chin. Phys. 14, 2287 (2005) J. Hauglund et al., Phys. Rev. B 48, 11685 (1993) D.J. Singh, L. Nordstrom, Planewaves, Pseudopotentials, and the LAPW Method (Springer, Berlin, 2006) P. Hohenberg, W. Kohn, Phys. Rev. 136, B864 (1964) P. Blaha, K. Schwarz, (Vienna University of Technology, Austria, 2002) J.P. Perdew et al., Phys. Rev. B 46, 6671 (1992) C. Suh et al., Data Sci. J. 1, 19 (2002) A. Daffertshofer et al., Clin. Biomech. 19, 415 (2004) L. Ericksson et al., Multi- and Megavariate Data Analysis: Principles, Applications (Umetrics Ab, Umea, 2001) H. Berthiaux et al., Chem. Eng. Process. 45, 397 (2006) S. Wold, M. Sjostrom, L. Eriksson, Chemom. Intell. Lab. Syst. 58, 109 (2001) D.V. Nguyen, D.M. Rocke (2002), p. 39 R. Rosipal, N. Kramer, in Subspace, Latent Structure and Feature Selection Techniques, ed. by C. Saunders, et al. (Springer, Berlin/Heidelberg, 2006), p. 34 P. Geladi, B.R. Kowalski, Anal. Chimica Acta 185, 1 (1986) S. de Jong, Chemom. Intell. Lab. Syst. 18, 251 (1993) 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. Part III Combinatorial Materials Science with High-throughput Measurements and Analysis Chapter 13 Combinatorial Materials Science, and a Perspective on Challenges in Data Acquisition, Analysis and Presentation Robert C. Pullar Abstract Combinatorial Materials Science is the rapid synthesis and analysis of large numbers of compositions in parallel, created through many combinations of a relatively small number of starting materials. It is, therefore, essential that for a truly combinatorial approach both synthesis and measurement must be high-throughput, to handle the large number of samples required. Since the first serious attempts at combinatorial searches in Materials Science in the mid 1990s, the technique is still very much in its infancy, falling way behind the progress made in biomedical and organic combinatorial chemistry, despite attracting increasing interest from industry. The most investigated materials by combinatorial methods are catalysts and phosphors, and most work has been on libraries in deposited thin film form. This chapter will give a broad overview of the different synthetic strategies used, with a particular look at the difficulties of producing thick film or bulk ceramic/metal-oxide libraries. A vast number of characteristics can be quantified in combinatorial materials libraries, from compositional, crystal phase, structural and microstructural information, to functional properties including catalytic/photocatalytic, optical/luminescent, electrical/dielectric, piezoelectric/ferroelectric, magnetic, oxygen-conducting, watersplitting, mechanical, thermal/thermoelectric, magnetoelectric/optoelectric/magnetooptic/multiferroic, bioactive/biocompatible, etc. This chapter will cover the range of high-throughput measurements open in combinatorial Materials Science, and especially the challenges in presenting and displaying the large and complex amount of data obtained for functional materials libraries. To this end, the use of glyphs is looked at, glyphs being data points that also contain extra levels of information/data in graphic form. R.C. Pullar (B) Departamento de Engenharia de Materiais e Cerâmica/CICECO - Aveiro Institute of Materials, Universidade de Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal e-mail: rpullar@ua.pt © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_13 241 242 R.C. Pullar 13.1 Combinatorial Materials Science—20 Years of Progress? Combinatorial materials science is the rapid synthesis and analysis of large numbers of compositions in parallel, created through many combinations of a small number of starting materials. It is, therefore, essential for a truly combinatorial approach that both synthesis and measurement must be high-throughput, to handle the large number of samples required. Combinatorial searching was initiated in the 1960s for the solid-phase synthesis of peptides by Merrifield [1] (Fig. 13.1), who later won a Nobel prize for this, but it took until the 1990s for industry to adopt this technique, which is now deemed essential in the pharmaceutical industry, where both sample preparation and analysis are carried out by robots. The first compositional gradients studied were those naturally occurring in codeposited thin films to construct alloy phase diagrams [2]. In 1970, Hanak proposed his ‘multiple sample concept’ in the Journal of Materials Science as a way around the traditional, slow, manual, laboratory preparation procedures used to make samples for testing [3]. Robotic search methods for cuprate superconductor ceramics was explored by the GEC Hirst Laboratories (Wembley UK) in the early 1990s [4], and a series of combinatorial searches in Materials Science were carried out in 1995 by Xiang, Schultz et al. [5], on a 128 sample combinatorial library of luminescent materials obtained by co-deposition of elements on a silicon substrate. This milestone paper was published in Science, and had a picture of the combinatorial library on the cover (Fig. 13.2). Since then, the interest in combinatorial materials science searches has increased greatly over the last 20 years, to the extent that there are now regular conferences on this specific field [6, 7]. The existence of a whole journal dedicated to Combinatorial Chemistry since 1999 (now renamed ACS Combinatorial Science), and special issues of Measurement Science and Technology on combinatorial materials science (e.g. 2005, vol. 16, issue 1), show how the field is growing (Fig. 13.3). There have also been several high-profile review papers on combinatorial methods [8] and highthroughput analysis [9, 10]. Using data from Scopus, it can be seen that the number of Fig. 13.1 The first true combinatorial synthesis, created by R.B. Merrifield, for the high-throughput parallel synthesis of peptides [1] 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 243 Fig. 13.2 The cover of the Science issue in 1995 containing the paper by Xiang and Schultz, with a photograph of a section of the luminescent thin film combinatorial library [5] combinatorial and high-through put papers is steadily increasing each year, but that clearly progress in Materials Science is lagging behind severely (Fig. 13.4). Indeed, if all the publications on combinatorial and high-throughput are broken down into their Scopus subject areas (Fig. 13.4), it can be seen that a quarter are in Biochemistry, and another quarter in Medicine or Engineering, indicating the dominance of the biomedical sector in this field. Materials Science accounts for only 4 % of all combinatorial and high-throughput articles over this period, and the situation is not rapidly improving, as looking at 2014 only, Materials Science is still in last place of all these categories, with only 6 %. Breaking down research into general combinatorial and high-throughput by country, it can be seen that the USA dominates hugely producing over 1/3 of all papers, but that a rapidly industrialising China is now in second place, ahead of the UK, Germany, Japan and France (Fig. 13.5). Looking at the institutions that have produced the most articles, all are in North America except for the University of Cambridge (UK) in 5th place, and the University of Tokyo, in 14th Place (Fig. 13.5). 244 R.C. Pullar Fig. 13.3 A selection of journals and special issues devoted to combinatorial synthesis However, if this data is analysed only for articles related to Materials Science (Fig. 13.6) it paints a different picture, with Japan now in a clear second place to the USA, which dominates even more, and four Japanese institutions are in the top ten, including first (Tokyo Institute of Technology), third (National Institute for Materials Science Tsukuba) and fourth (Japan Science and Technology Agency) places. In second position is the National Institute of Standards and Technology (NIST, USA), which has initiated a very large research programme into Combinatorial Materials Science. Currently, industry is already heavily involved in the development of synthesis techniques, and the development and automation of measurements, suitable for combinatorial searches—indeed, it should be noted in Fig. 13.5 that the biomedical company Pfizer is in 13th place. Major companies investing in such research also include Hitachi, General Electric, Kodak, Ciba, Hoechst, Bosch, Bayer, BF Goodrich, Siemens, Dow, Englehard, Dupont, L’Oreal, ICI, PPG, Unilever, Procter & Gamble, Intel, Heraeus, Alcoa, Celanese, Rhodia, Shell, Exon-Mobil, Volkswagen, Honeywell, Degussa, Azko Nobel, Lucent Technologies (Bell Labs) and BASF [11], and Kurt J Lesker Co. have developed commercial combinatorial PLD (Pulsed Laser Deposition) systems. However, combinatorial and high-throughput methods for materials science are in still their infancy. The main activity is in the USA and Japan, with the leading countries in the EU being Germany and the UK, reflecting the output of academic papers shown above. A search on Scopus revealed a total of 17 800 patents on combinatorial and high throughput synthesis, but only 1000 of these were related to materials science. 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 245 Yearly Combinatorial & High Throughput Publications from Scopus 16000 Mat Sci Combinatorial & High Throughput 14000 All High Throughput 12000 All Combinatorial & High Throughput 10000 8000 6000 4000 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2000 Combinatorial Publications per Subject Area 0 Biochem, Genetics & Molcular Biology Engineering 4% 4% Medicine 4% 24% 4% 4% Chemistry Computer Science Pharmacology 7% Biological Sciences 15% 11% Physics Immunology and Microbiology 11% 12% Chemical Engineering Materials Science Fig. 13.4 Yearly publications on combinatorial and high-throughput topics since 1995, and below, publications on combinatorial or high-throughput per Subject Area over this period, data from Scopus. Searches were for (TITLE-ABS-KEY(combinatorial OR “high throughput”)), (TITLE-ABSKEY(“high throughput”)) and (TITLE-ABS-KEY(combinatorial OR “high throughput”) AND TITLE-ABS-KEY(“materials science” OR ceramic OR composite OR film OR sol-gel), respectively, with results classified under (SUBJAREA, “MATH”) excluded 246 R.C. Pullar USA China UK Germany Japan France Canada Italy India South Korea Australia Switzerland Spain Netherlands Taiwan Sweden Rest of World Fig. 13.5 The countries and institutions that have published the most articles on general combinatorial and high-throughput topics between 1995–2014 (data from Scopus) 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 247 Fig. 13.6 The countries and institutions that have published the most articles on combinatorial and high-throughput Materials Science between 1995–2014 (data from Scopus) 13.2 Combinatorial Materials Synthesis Much current high-throughput combinatorial research is focused on biotechnology and biological systems [12]. However, here I shall only look at the state of the art in Materials Science of metals, oxides and ceramics. To date, most such combinatorial high-throughput methods use thin films, deposited on the nanoscale by various methods. If we break down the combinatorial and high throughput Materials Science papers by type of material investigated, we can see that the vast majority are on thin films and/or nanoparticles and nanosynthesis, usually by deposition 248 R.C. Pullar Yearly Mat Sci Combinatorial or High Throughput Publications from Scopus Ceramics & Thick Film NPs or nanosynthesis Thin Films 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Mat Sci & Ceramic & Composite & Film 500 450 400 350 300 250 200 150 100 50 0 Fig. 13.7 Yearly publications on combinatorial and high-throughput Materials Science since 1995, for the named topics (data from Scopus) (Fig. 13.7), and that very few are on bulk or thick film ceramics despite the fact that such materials can have completely different properties and applications than their thin film/nanoscale analogues. In fact, it can be seen that research into such materials peaked in 2008, while the other categories have continued to increase, demonstrating both the difficulty and the need in developing combinatorial techniques for bulk, sintered ceramics. Several methods of high-throughput thin-film synthesis techniques have been developed for exploring new compositions, as well as for optimising process parameters of materials. Methods to prepare different types of combinatorial thin-film libraries include discrete sequentially masked depositions [13] or composition spread co-deposition [14] by molecular beam epitaxy (MBE), pulsed laser deposition (PLD), liquid source misted chemical deposition (LSMCD) [15], composition-gradient molecular layer epitaxy [16], ion beam sputtering deposition (IBSD) and chemical vapour deposition (CVD) [17]. All of these techniques tend to result in libraries with uneven thickness and stoichiometry, but they allow for easy mapping of the structural changes or functional properties. MBE, or combinatorial laser MBE (CLMBE), uses a mask pattern which can be designed on a computer to use a masked carrousel, evaporating several targets with a laser to deposit epitaxial layers on the substrate, with variations in relative stoichiometry across the library (Fig. 13.8). MBE can be more accurate than PLD, as is involves monolayer epitaxial growth, but this naturally makes it very slow, which is not ideal for high-throughput synthesis. Other deposition methods such as PLD, 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 249 Fig. 13.8 Combinatorial laser molecular beam epitaxy (CLMBE) process: a Mask designed on computer; b compositional spread created on substrate one layer at a time using masks; c rapid high-throughput analysis possible, such as luminescence under UV light; d photograph of Tb1−x−y Scx Pry Ca4 O (BO3 )3 library under 254 nm UV excitation; e emission intensity map same ternary thin film library of Tb1−x−y Scx Pry Ca4 O (BO3 )3 [18] Fig. 13.9 The pulsed laser deposition (PLD) combinatorial film deposition process CVD and IBSD are quicker, and use similar masking effects or shutters (Fig. 13.9), but result in more variable compositions. Combinatorial methods which can be applied to bulk or thick film ceramics usually use either high-throughput synthesis of powders [19] (Fig. 13.10) or ink-jet printing methods [20] (Fig. 13.11). Much of the work on combinatorial powder synthesis involves only the production of a library of powders with a compositional 250 R.C. Pullar Fig. 13.10 Combinatorial powder library of doped TiO2 made continuous hydrothermal synthesis [24], and SEM images of a 48 sample library of perovskite powders also produced via a hydrothermal batch process [19] spread, with no subsequent high-throughput processing, e.g. autopipetting of sols to produce small amounts of combinatorial powders [21], powder metallurgy using acoustic vibration valves to dispense powders [22], solution combustion synthesis of combinatorial libraries of photocatalytic perovskites in microwells [23], or continuous hydrothermal synthesis of combinatorial libraries of oxide powders from nanoparticle suspensions [24, 25] (Fig. 13.10). Some workers also incorporate a high-throughput processing method with synthesis, e.g. a combinatorial robot system for measuring, mixing and moulding liquid samples by automatic micropipette to produce a library for ceramics on a pallet [26], or robotic dosing and planetary ball milling of 40 different samples with a parallel pressing of 5 samples at a time [27]. The ink-jet printing process creates a thick film library already laid out on a substrate ready for processing [28, 29], and this method has been particularly successful in the discovery of new phosphors [30, 31], such as the 121 sample library of the 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 251 Fig. 13.11 Schematic diagram of the combinatorial inkjet printing process, and libraries thus created of red Y2−x−y Eux Biy O3 and blue KrSr1−x−y PO4 :Tb3+ x Eu2+ y phosphors under UV light [31, 32] KrSr1−x−y PO4 :Tb3+ x Eu2+ y UV phosphor system or the red Y2 O3 -based phosphor libraries created by Chan et al. [31, 32] (Fig. 13.11). The author, R. C. Pullar, was part of the Functional Oxides Discovery using Combinatorial Methods (FOXD) project, using the London University Search Instrument (LUSI) robot to make sintered combinatorial libraries of ceramic compositions. LUSI automatically created sintered bulk ceramic libraries by ink-jet printing multicomponent mixtures on substrates, then robotically loading the libraries into a flatbed 4-zone furnace for firing, with up to 100 ◦ C difference between each zone, unloaded the samples, and could also place them on a test bed for measurement [33]. The ceramics under investigation in the FOXD project were dielectrics, ferroelectrics [34–36] and ionic conductors [37], and libraries were made and characterised (Fig. 13.12). 252 R.C. Pullar Fig. 13.12 The LUSI robot [33], a printed and sintered Ba1−x Srx TiO3 library on a single 50 mm long substrate (each dot 1–2 mm wide), SEM images of the sintered library, EDS measurements showing the variation in composition cross the library, and dielectric measurements showing the functional gradient in Curie temperature across the library [34, 35] 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 253 Other workers have also made and characterised libraries of thin film dielectric ceramics, such as ternary oxide ZrOx -SnO y -TiOz [14], microwave dielectrics [38], high εr (50–80) HfO2 -TiO2 -Y2 O3 dielectrics [39], sol-gel piezoelectrics [40], and a 64 sample LSMCD ferroelectric Bi3.75 Lax Ce0.25−x Ti3 O12 library [15]. Bulk ceramic piezoelectrics have also been studied via combinatorial methods [41]. Many other ceramics have been investigated such as high-throughput analysis of semi-conductors [42], combinatorial physical vapour deposition of gold nanoparticles [43], sol-gel metal oxide nanoparticles [26], SOFC ceramics [24], catalysts [44], pigments [45], gas sensor materials [46], electrochemical electrodes [47] and hydrogen energy storage materials [48]. Maybe the best known new material discovered via combinatorial methods is the Co-doped TiO2 dilute magnetic semiconductor [49], discovered by chance in a combinatorial search of 162 thin film photocatalyst candidate materials [16], and leading to an explosion of interest in such materials. In his review [10], Zhao lists 23 new materials successfully discovered by combinatorial high-throughput searches, and examples include Zr0.2 Sn0.2 Ti0.6 O2 dielectrics from libraries of 30 multicomponent systems [50], high εr microwave dielectrics [14, 39, 51], cobalt oxide magnetoresistance materials [52], hydrogen storage candidates [48], novel photocatalysts [23], and improved catalysts from a library of thousands of samples [44]. In their analysis of new generation capacitance materials for random access memory devices, replacing amorphous silica with optimised materials based on ZrO2 -SnO2 -TiO2 , Koinuma and Takeuchi [8] suggest that 900 one-by-one sputtering preparations would have been required to fully explore the combinations of the ternary Zr-Sn-Ti oxide system prepared in the compositional spread by van Dover et al. [50]. Most combinatorial materials searches involve thin film techniques, and for some applications where the end product will be exclusively in thin film form, this does make sense. However, many materials are also required in bulk form, and the bulk properties can be quite different to those of thin films, where surface diffusion, strain effects from substrate-lattice mismatch, and surface and skin electrical effects dominate. For example, ferroelectric functions are highly dependent upon strain effects in thin films. Also, most thin films are epitaxial or single crystal, and hence have no grain boundaries, which can have a large effect on electrical, magnetic, dielectric, mechanical and transport properties. From the point of view of constructing large materials properties data bases for data mining and prediction of novel compositions, it could be argued that bulk properties are much more relevant than those of thin films. Furthermore, for many applications bulk or thick film ceramics are required, e.g. multilayer chip capacitors, low temperature co-fired ceramics (LTCC), structural and engineering ceramics, refractories, clays, glazes and household ceramics, SOFC and ionic conductors, electromagnetic and radar absorbing materials (RAM), catalyst supports, substrates, etc. As discussed above all current bulk ceramic combinatorial projects either just make a combinatorial library of powders through a high-throughput synthesis process (e.g. hydrothermal) [24], or they use a solution based process to deposit or print a library on a substrate (e.g. ink-jet printing) [29]. In the first case, there is no high-throughput processing, and each sample in the library must be individually 254 R.C. Pullar prepared (e.g. pressed in a die) from the powder, usually by hand. In the second case, the solution chemistry, stability during a printing run, drying in a regular shape, and reaction with, or lack of adherence to, the substrate, becomes a serious issue, especially in complex multi-component systems. The issue of reactions with, or lack of adherence to, substrates is also an issue for all ink-jet based combinatorial processes, as they need a substrate that can be both printed on and heated during sintering [28]. It was Dr. Pullar’s experiences with ink-jet printing that led him to consider an alternative solution, which gave the benefits of a bulk or thick film ceramic library, but in a much simplified, novel adapted tape casting process. Using a minimum of solvent, the combinatorial components could be mixed with a commercial mixer tip, designed for mixing adhesives, polymers and dental cements. Unlike a solution based process, much less volume is lost on drying, leading to a denser green body that should produce dense ceramics, and no segregation or precipitation effects should occur. The libraries can be made either on a substrate, or a release tape which can be removed before firing, avoiding substrate problems if necessary. As the sintering step is often a rate limiting step in combinatorial ceramics, a multiple zone furnace was used to simultaneously fire five libraries at different temperatures. This technique has been used to create sintered libraries of magnetoelectric SrFe12 O19 /BaTiO3 composite ceramics, in compositional steps of 10 %, in which the two phases, one magnetic and the other dielectric, did not react, maintaining their respective characters (Fig. 13.13) [53]. 13.3 High-Throughput Measurement and Analysis The importance of combinatorial high-throughput materials science has been clearly shown above, and although still in its infancy, the fact that so many industries are investing in developing such techniques demonstrates their belief in its future significance. Once established, its impact on Materials Science will be enormous, like on that of the pharmaceutical industry, and increasingly biomedicine and biochemistry. However, to be successful, combinatorial materials synthesis also requires highthroughput measurement. The diverse spectrum of functionalities in materials represents a significant challenge in high-throughput characterisation, and often involves development of novel measurement methods [9]. Zhao’s review paper [10] is good overview of the techniques available for combinatorial high-throughput analysis, although it does concentrate on the low-micro and nanoscale, which is by no means all that is of interest. It must be understood that the aim of characterisation in combinatorial science is a broad brush mapping or analysis of the sample to show trends and unexpected or complimentary properties, not a precise measurement—that can come later on materials of interest. The properties of combinatorial libraries can be measured as a function of composition to give a functional gradient, which can also vary with processing conditions between identical libraries processed differently. Properties that can be investigated include: 13 Combinatorial Materials Science, and a Perspective on Challenges . . . Ms @ 3 T / A m2 kg-1 70 255 Ms of composite library 60 50 40 30 20 10 0 100 90 80 70 60 50 40 30 20 10 0 Norminal % SrM Fig. 13.13 A photograph of sintered, bulk SrFe12 O19 /BaTiO3 library (with compositional ratios long the library from 9:1 to 1:9 for SrM:BT); a diagram of the parallel high-throughput firing process; SEM images of the microstructure of the library; EDS spectra of compositional variation along the library; magnetic measurements showing the functional variation in magnetisation (Ms ) along the library [53] • • • • Composition and phase purity/solid solutions/lattice parameters/crystal structure Microstructure/density/porosity/grain boundaries and segregation Mechanical properties: hardness, elastic modulus, stress/strain, etc. Electrical properties: Conductivity, superconductivity, ionic conduction, oxygen vacancies 256 R.C. Pullar • Dielectric properties: permittivity, ferroelectricity, piezoelectricity, capacitance, Cuie points • Magnetic properties: domains, magnetisation, hysteresis loops, Curie points, • Optical properties: electro-optics, magneto-optics, luminescence/fluorescence • Thermal properties: Thermal conductivity, thermoelectrics, thermal creep, thermal expansion • Multiferroics: multiferroic and magnetoelectric coupling, responses to direct and indirect stimuli • Chemical reactions: catalysis, selectivity, redox reactions, fuel cells, water splitting/H2 production • Band gaps: photocatalysis, solar energy, semiconductivity, smart materials, etc. • Biomaterials: antibacterials, human compatibility, biological markers, etc. Many of these parameters can also vary with changes in measurement temperature, pH, wavelength of light, or applied electrical or magnetic fields, adding yet another layer of complexity, and creating yet more data. The most basic tools for characterising or mapping composition and phases present are XRD (x-ray diffraction) and EDS (energy dispersive spectroscopy, also known as EDX—energy dispersive x-ray analysis). Use of Real Time Multiple Strip (RTMS) XRD detectors, such as the PANalytical PIXcel range or Shimadzu OneSight, has become essential for the rapid high-throughput structural characterisation of combinatorial libraries, by greatly speeding up the measurement time with little or no loss of resolution, meaning that scans that would normally take hours can be carried out instead in a few minutes (Fig. 13.14). As well as identifying the phases present, XRD also gives structural information, lattice parameters, etc., and the coefficient of thermal expansion (CTE) can be evaluated from changes in lattice parameters with temperature [10]. EDS is another very rapid technique, in which measurements take a few minutes, which can be used to identify elements present in a scanning electron microscope (SEM) image (see Fig. 13.13), and can also map the distribution of those elements. While doing EDS, SEM images can also be rapidly taken, to study microstructure, porosity, sintering, liquid phases, phase/grain boundaries, etc., and Electron BackScatter Diffraction (EBSD) is also a useful tool for identifying crystal structures in reference to a library of known structures. Many points can be measured in a minute on polished samples, and it can also measure changes in orientation in anisotropic samples. Scanning probe techniques such as Atomic Force Microscopy (AFM), Piezoresponse Force Microscopy (PFM) and Magnetic Force Microscopy (MFM) are often collectively called Scanning Probe Microscopy (SPM, Fig. 13.15), and are ideal tools for high-throughput mapping of the functional gradients of combinatorial libraries, and can map a library in minutes. PFM can show piezoelectric grains and domains, and with measurements before and after poling, hysteresis loops and piezo coefficients (d33 ) can be measured. MFM can show magnetic domain structure, and with an external magnetic field it can show hysteresis loops and coercivity (Hc ) values for each point measured, but it cannot be used to measure magnetic moments or 13 Combinatorial Materials Science, and a Perspective on Challenges . . . x = 0.2 x = 0.4 x = 0.4 x = 0.5 x = 0.5 x = 0.6 x = 0.7 x = 0.7 Tetragonal + Orthorhombic Intensity / arbitrary units x = 0.0 Tetragonal Ba1-x CaxxTiO TiO33 Library Library 1-x 257 x = 0.8 x = 1.0 x = 1.0 31 32 33 2 theta / degrees 34 Orthorhombic x = 0.9 x = 0.9 Fig. 13.14 RTMS XRD detectors, which work by simultaneously measuring over a range of angles, greatly speed up XRD measurements of combinatorial libraries. The measurements below, taken in only 2 min each by the author over a range of 20–70◦ , clearly show the change in structure from tetragonal BaTiO3 to orthorhombic CaTiO3 across a bulk ceramic Ba1−x Cax TiO3 library [36] saturation magnetisation (Ms ) values, as it is measuring the remnant magnetisation. Nanoindentation can be carried out by the AFM tip to give mechanical properties such as hardness and elastic modulus. A related technique is Atomic Force Acoustic/ultrasonic Microscopy (AFAM), which vibrates the AFM cantilever in contact mode, the change in resonant frequencies giving information about stiffness 258 R.C. Pullar MFM PFM Growth temperature Fig. 13.15 Scanning Probe Microscopy (SPM) techniques useful for high-throughput combinatorial libraries: Top, Magnetic Force Microscopy (MFM) can measure magnetic properties and map magnetic domains, and Piezoresponse Force Microscopy (PFM) can measure piezoelectric hysteresis loops at different spots on a library, and map piezoelectric domains. Bottom, the Evanescent Microwave Probe (EMP), or Scanning Evanescent Microwave Microscope (SEMM), can map dielectric properties over a library such as permittivity (εr ) or dielectric loss (tan δ), as shown by the tan δ maps of a Ba1−x Srx TiO3 thin film library with varying growth temperature, and the εr and tan δ maps of a ternaryBa1−x−y Srx Cay TiO3 thin film library [54] and local elastic constants. The Evanescent Microwave Probe (EMP), or Scanning Evanescent Microwave Microscope (SEMM), is a kind of SPM that measures the change in dielectric properties of a metal tip embedded in a microwave resonator just above, or in contact with, the surface [54]. Interaction with the sample changes the resonant frequency and dielectric loss of the resonator, and from this electrical conductivity, permittivity (εr ) and quality factor (Q) of the sample can be calculated and mapped in minutes, and although accurate quantitative measurements are problematic, EMP has been used on combinatorial libraries [54] (Fig. 13.15). Electrical conductivity measurements are very important in combinatorial searches for insulators, superconductors, semiconductors and thermoelectrics, and also in dielectrics and ferroelectrics along with permittivity and dielectric loss (tan δ, Q ≈ 1/ tan δ). Bulk samples and thick films can be analysed by a simple capacitance method, if top and bottom electrodes are applied, to measure all of these values quickly, and over a range of temperatures with longer runs to give Curie points (Tc ) and ferroelectric/relaxor behaviour. The author has used such a method to simultaneously measure multiple points in bulk dielectric ceramic combinatorial libraries 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 259 Fig. 13.16 Example of a multiple electrode array used for high-throughput electrical measurements of libraries, and the map of permittivity over a ternary Ti-Hf-Y oxide thin film library [39] Fig. 13.17 General schematic diagram for the high-throughput optical measurement of combinatorial libraries, where light could be of various wavelengths (UV, visible, laser) and many kinds of spectrometer could be applied, and optical measurements of transmittance and band gap at UV wavelengths for a Zn1−x Mgx O thin film library between 150–450 K, clearly showing Tc and phase transitions [35, 36]. An 8 mm2 , 64 multielectrode array for high-throughput impedance spectroscopy, suitable for dielectrics, semiconductors and electrocatalysts, has also been developed [55] and used to test for gas sensing abilities across a library, and εr and leakage current have also been mapped from capacitance-voltage (C-V) and current-voltage (I-V) measurements of ferroelectric PLD thin film libraries [39] (Fig. 13.16). Optical techniques such as FTIR, Raman and UV-Vis spectroscopy, colourimetry, cathodoluminescence (CL) and photoluminescence (PL) are clearly suitable for highthroughput analysis and mapping (Fig. 13.17), needing short measurement times and measuring only the spot in the beam at one time. The first two techniques can give information on chemical bonding, phonon modes (and dielectric loss), polarisation, 2D spectra with changes over time/environment/temperature, and fingerprints of molecules and structures. The others have been used in combinatorial searches for pigments, phosphors, diodes and display materials. CCD cameras have also been used to measure the output/absorption of combinatorial libraries [10, 56], and spatially resolved infrared imaging used as a high throughput hydrogen storage candidate screening technique [48]. 260 R.C. Pullar Fig. 13.18 Left SMOKE map of Co-Fe-Ni ternary system, showing the magnetic hysteresis loop extracted from just one of the pixels/data points [10]. Right scanning SQUID images of La1_x Cax MnO3 taken at 7 K, showing the magnetic domains and the transition from strong to weak magnetization [57] Magnetic techniques other than MFM of great interest, and needing more development, are Scanning Magneto-Optical Kerr Effect (SMOKE) probes and scanning SQUID microscopy (Fig. 13.18). SMOKE can measure Ms if an external magnetic field is applied, and a hysteresis loop and coercivity data can be extracted from every pixel on the combinatorial map [10], but it is difficult to measure low electrically conductive oxides. Scanning SQUIDs have also been used to map combinatorial thin films [57], and a recent development is one that can operate at room temperature, although they cannot use an external field, and therefore only measure remnant magnetisation and are not currently capable of quantitative measurements. Thermal conductivity mapping can be measured by the change in thermoreflectance of sample, heated by a femto-second pulsed laser, which has been coated with an Al film to absorb the 770 nm Ti:sapphire laser [10]. Mass spectroscopy (MS) has been used a lot in high-throughput catalytic analysis [44, 58], and a robot system measuring with an electrode array to form 16 electrochemical cells has been used in combinatorial searches for new electrode materials [47]. Catalysis is probably the field where the most progress has been made in combinatorial Materials Science, driven by industry and the relative ease of high-throughput measurement, and such synthesis and measurement systems are quite well established now, with automatic data extraction and analysis [58] (Fig. 13.19). IR-sensography has been developed where an IR-camera acts as an external optical detector system for sensor libraries, detecting small temperature changes due to physisorption or chemisorption [59], and high-throughput impedance screening has been used on gas-sensing materials in variable atmospheres and temps [55? ]. There has been a lot of development on the high-throughput search for electrochemical and gas sensor materials, including a 16 sample SnO2 -based gas sensor library for testing as an electronic nose, and a complete high throughput assembly consisting of a 64 sample reactor for the sensor libraries (Fig. 13.20), with IR-cameras, switching multimeter for dc-resistance measurements and impedance measurements, a test gas supply array for different test gases, and software for control of experimental flow, data recording, data evaluation, data mining and a database [46, 60]. Ambitious projects like this are where the eventual future of combinatorial materials science lies, in fully integrated, automated high-throughput synthesis and measurement systems, with artificial intelligence (AI) driven control, analysis and data mining [61]. 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 261 Fig. 13.19 High-throughput set up for discovery of catalysts, where the same probe is used to both deposit the samples and measure the library after processing. a A schematic diagram of the setup and a photograph of the 207 sample library produced, b measured screening results for catalysis across the library and c visualisation of the results on the layout of the sample [58] 13.4 Data Analysis and Presentation If we are going to generate large amounts of data, we also need to be able to analyse it, understand it, and interpret it in a way which is comprehensible. In an ideal combinatorial system, the flow would be as shown in Fig. 13.21, with synthesis, processing and measurement of libraries all carried out by a single robot, which would then feed the data to a data base, which could be data mined and used to predict the next likely candidate systems to be investigated based on all results so far, feeding back to the next step in the synthesis process. In reality, combinatorial Materials Science is still a long way from such an automated feedback process, although much progress is being made on data mining using various forms of statistical analysis, AI and evolutional software [62–65]. Other authors will deal with this topic, but also of great importance is how we can interpret and present such a multitude of results and data to a human audience, in such a way that it can be easily comprehended. The number of degrees of freedom in a simple binary component system are staggering—in A1−x Bx we have x 262 R.C. Pullar Fig. 13.20 Schematic diagram of the set up for the combinatorial synthesis and high-throughput measurement of 64 sample gas sensor libraries, the multi-electrode array used for the library, and measured results showing the Argand plots of impedance, and sensitivity to 25 ppm of H2 (S, height of bar) with measuring temperature (each progressive bar, at 250, 300, 350 and 400 ◦ C) for all 54 samples in the library [60] Choice of starting materials Synthesis of compositional step / gradient libraries High-throughput sintering / processing All carried out by a single robot AI search & prediction neural network Results to database & data mining Automated measurement of libraries Fig. 13.21 The ideal combinatorial synthesis, processing, measurement and analysis set-up 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 263 Fig. 13.22 Triangular phase diagram map of a ternary library were change in colour/contrast depicts variation in a property compositions, which could be processed for various temperatures, periods or pressures/atmospheres, and we want to show evolution/existence of crystalline phases, perhaps details on microstructure, and functional properties, ideally all in a single image or graphic. Clearly this is quite a challenge, and with ternary A1−x−y Bx Cy systems it becomes even more so. Triangular phase diagram maps can be used to show variation in composition with position and variation in a property with change in colour or contrast (Fig. 13.22). However, problems arise when we want to see the effects on various properties, or a plot of data, for each data point. A very interesting overview of the various possibilities for high-throughput analysis, and examples of unusual ways to display those results, is detailed in the review by Potyrailo et al. [66]. A solution to this is the use of glyphs—a glyph being a single data point that contains extra data in graphic form. A simple way to achieve this is to have each point on a binary or ternary plot as a different colour, size, shape or transparency, where the variation in these characteristics represents variations in functional or structural properties, for example the plot by the author showing the ternary Bax Sry Caz TiO3 system in Fig. 13.23. In this plot the position represents the composition, the colour shows the position of the main XRD peak, telling us whether we have orthorhombic or tetragonal phases and solid solutions, and the size of the point shows the measured permittivity, from 147 for the smallest to 3573 for the largest. Further changes in feature such as point shape or transparency/fill could be used to indicate other properties in the same plot. Another example is shown in Fig. 13.24, from the author’s paper [53], of a binary magnetoelectric composite SrM-BT combinatorial library. The black line shows the real change in composition, with the red area representing SrM and green BT, the plotted spheres show the magnetisation (the centre of each sphere being the data point), and the relative volume of the 3D spheres represents the permittivity (the larger permittivity values are too big to be represented as areas of 2D circles). It can easily be seen that the evolution in composition is not completely 264 R.C. Pullar Fig. 13.23 Plot by the author of the ternary Bax Sry Caz TiO3 system, using glyphs to show composition (position), position of the main XRD peak (colour) and permittivity (size of the point) linear along the library (or else the red/green diving line would a straight diagonal), that these non-linear variations in composition are reflected in the magnetisation values, and as magnetisation of the composite decreases permittivity increases. The second image in Fig. 13.24 shows seven degrees of data in a single plot: The quaternary Co-Te-Mn-Cr oxide catalyst system composition is depicted by the 3D pyramid, and the activity of each sample in the library towards one of three different molecules is shown by changes in colour, size and transparency of the data point [66, 67]. Another approach is to use pie chart glyphs for each point, as also shown in Fig. 13.24. One example is from the author’s paper [53], showing the same magnetoelectric composite library, but this time with the points representing the magnetisation on the y axis and the supposed % of SrM on the x axis, and a pie chart showing the actual measured relative proportions of the two ceramic phases in that sample of the composite library, in an easy-to-comprehend manner. It can again been seen that non-linear behaviour in the magnetisation through the library matches discontinuities in the quantity of the SrM magnetic phase. The final image in Fig. 13.24 shows a structural phase diagram produced using the weights various x-ray diffraction patterns, with the position of each point showing the composition in a ternary Fe-Pd-Ga material, and the glyph pie charts showing the relative proportions in that composition of seven possible phases found throughout the library, indicated by different coloured sections in the pie charts [68]. Data point glyphs can also be used to contain actual images, such as plots, SEM images or photographs. Two examples are shown in Fig. 13.25. The first example shows a compositional triangle in a ternary substituted ferroelectric BiFeO3 library, but at each data point is a small image glyph of the measured ferroelectric polarisation hysteresis loop [69]. Not only can the general shape of each loop also be 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 265 Fig. 13.24 Various uses of glyphs in displaying complex series of combinatorial data. From top left a chart showing relative proportions of two phases (red and green, black line), magnetisation (position on left axis) and permittivity (volume of 3D sphere) [53]; A four component catalyst library, with composition shown by position in the 3D pyramid, and catalytic activity with three molecules shown by colour, size and transparency of the data point [66, 67]; A chart showing magnetisation (on left axis), supposed % of SrM in the composite (bottom axis) and actual compositional proportion by a pie chart glyph for the data points [53]; Triangular map of composition, with pie charts and the use of 7 different colours for the data point glyphs to show the composition of phases at each point [68] seen qualitatively, and those compositions with a good hysteresis loop easily identified, the individual loops can then be enlarged to give fully quantitative information on that loop. The second shows the compositional map of a Fe-Co-Mo alloy, with a magnetic hysteresis loop glyph at each data point, which again can be enlarged to give fully quantitative magnetic data [63]. In this case, the plot contains both in-plane (IP) and out-of-plane (OP) measurements. This last paper was part of a combinatorial search for replacements for the increasingly expensive rare earth magnets. Clearly, many other kinds of glyph could be used as a data point. Furthermore, if plots are in online or electronic form, they can be interactive, with enlarged plots, more details, or even several different properties plots and images given when the 266 R.C. Pullar Fig. 13.25 Top, pseudoternary compositional map of ferroelectric BiFeO3 , co-doped with (Bi, Sm) and (Fe, Sc), with the ferroelectric hysteresis loop measured at each point shown as a glyph. The six loops highlighted in the red rectangle as shown in enlarged form below, demonstrating the fully quantitative nature of these data [69]. Bottom, compositional map of a Fe-Co-Mo alloy, with magnetic hysteresis loop glyphs at each data point (a), which can be enlarged to give fully quantitative magnetic data (b) [63] 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 267 relevant glyph is clicked upon, touched or activated. This opens up a whole new area of interactive combinatorial data display and analysis, and exciting new way to handle and explore the large amount of data generated in high-throughput searches. Acknowledgments The author would firstly like to thank the FCT (Fundação para a Ciência e a Tecnologia in Portugal), and the FCT Ciência 2008 program and grant SFRH/BPD/97115/2013 are acknowledged for funding the author during the writing and publication of this chapter. The author would also like to thank the publishers and copy write holders of all figures from previous sources used in this chapter, which have been referenced in the relevant figure caption. References 1. R.B. Merrifield, Solid phase peptide synthesis. I. The synthesis of a tetrapeptide. J. Am. Chem. Soc. 85, 2149–2153 (1963) 2. K. Kenedy, T. Stefansky, G. Davy, V.F. Zacky, E.R. Parker, Rapid mapping for determining ternary-alloy phase diagrams. J. Appl. Phys. 36, 10–3808 (1965) 3. J.J. Hanak, The multiple sample concept in materials research; synthesis, compositional analysis and testing of entire multi-component systems. J. Mater. Sci. 5, 964–971 (1970) 4. S.R. Hall, M.T.R. Harrison, The search for new superconductors. Chem. Br. 30, 739–742 (1994) 5. X.-D. Xiang, X. Sun, G. Briceno, Y. Lou, K.-A. Wang, H. Chang, W.G. Wallace-Freedman, S.-W. Chen, P.G. Schultz, A combinatorial approach to materials discovery. Science 268, 1738– 1740 (1995) 6. Proceedings of the first Japan-US Workshop on Combinatorial Materials Science and Technology. Appl. Surf. Sci. 189, 175–371 (2002) 7. Proceedings of the Second Japan-US Workshop on Combinatorial Materials Science and Technology. Appl. Surf. Sci. 223, 1–267 (2004) 8. H. Koinuma, I. Tekeuchi, Combinatorial solid-state chemistry of inorganic materials. Nat. Mater. 3, 429–438 (2004) 9. R.A. Potyrailo, I. Takeuchi, Role of high throughput characterization tools in combinatorial materials science. Meas. Sci. Tech. 16, 1–4 (2005) 10. J.-C. Zhao, Combinatorial approaches as effective tools in the study of phase diagrams and composition-structure relationships. Prog. Mater. Sci. 51, 557–631 (2006) 11. J. Ouellette, Combinatorial materials synthesis. Ind. Phys. 4, 24–27 (1998) 12. E.W. McFarland, W.H. Weinberg, Combinatorial approaches to materials discovery. Trends Biotechnol. 17, 107–115 (1999) 13. Y. Matsumoto, M. Murakami, Z. Jin, A. Ohtomo, M. Lippmaa, M. Kawasaki, H. Koinuma, Combinatorial laser molecular beam epitaxy (MBE) growth of Mg–Zn–O alloy for band gap engineering. Jpn. J. Appl. Phys. 38, L603–L606 (1999) 14. R.B. van Dover, L.F. Schneemeyer, R.M. Fleming, Discovery of a useful thin film dielectric using a compositional-spread approach. Nature 392, 24–27 (1998) 15. K.W. Kim, M.K. Jeon, K.S. Oh, T.S. Kim, Y.S. Kim, S.I. Woo, Combinatorial approach for ferroelectric material libraries prepared by liquid source misted chemical deposition method. Proc. Natl. Acad. Sci. USA 104, 1134–9 (2007) 16. T. Fukumura, M. Ohtani, M. Kawasaki, Y. Okimoto, T. Kageyama, T. Koida, T. Hasegawa, Rapid construction of a phase diagram of doped Mott insulators with a composition-spread approach. Appl. Phys. Lett. 77, 3426–3428 (2000); T. Fukumura, M. Kawasaki, Z. Jin, H. Kimura, Y. Yamada, M. Haemori, Y. Matsumoto, K. Inaba, M. Murakami, R. Takahashi, T. Hasegawa, H. Koinuma, Combinatorial search for transparent oxide diluted magnetic semiconductors, in Proceedings of the Materials Research Society, vol. 700 (2001) S2.6 268 R.C. Pullar 17. A. Kafizas, G. Hyett, I.P. Parkin, Combinatorial atmospheric pressure chemical vapour deposition (cAPCVD) of a mixed vanadium oxide and vanadium oxynitride thin film. J. Mater. Chem. 19, 1399–1408 (2009) 18. R. Takahashi, H. Kubota, M. Murakami, Y. Yamamoto, Y. Matsumoto, H. Koinuma, Design of combinatorial shadow masks for complete ternary-phase diagramming of solid state materials. J. Comb. Chem. 6, 50–53 (2004) 19. R. Wendelbo, D.E. Akporiakye, A. Karlsson, M. Plassen, A. Olafsen, Combinatorial hydrothermal synthesis and characterisation of perovskites. J. Eur. Ceram. Soc. 26, 849–859 (2006) 20. J.R.G. Evans, M.J. Edirisinghe, P.V. Coveney, J. Eames, Combinatorial searches of inorganic materials using the ink-jet printer; science, philosophy and technology. J. Eur. Ceram. Soc. 21, 2291–2299 (2001) 21. C.J. Vess, J. Gilmore, N. Kohrt, P.J. McGinn, Combinatorial synthesis of oxide powders with an autopipetting system. J. Comb. Chem. 6, 86–90 (2004) 22. S. Yang, J.R.G. Evans, Device for preparing combinatorial libraries in powder metallurgy. J. Comb. Chem. 6, 549–555 (2004) 23. J. Ding, J. Bao, S. Sun, Z. Luo, C. Gao, Combinatorial discovery of visible-light driven photocatalysts based on the ABO3 -type (A) Y, La, Nd, Sm, Eu, Gd, Dy, Yb, B) Al and In) binary oxides. J. Comb. Chem. 11, 523–526 (2009) 24. A. Cabañas, J.A. Darr, E. Lester, M. Poliakoff, Continuous hydrothermal synthesis of inorganic materials in a near-critical water flow reactor; the one-step synthesis of nano-particulate Ce1−x Zrx O2 (x=0-1) solid solutions. J. Mater. Chem. 11, 561–568 (2001) 25. R. Wendelbo, D.E. Akporiakye, A. Karlsson, M. Plassen, A. Olafsen, Combinatorial hydrothermal synthesis and characterisation of perovskites. J. Eur. Ceram. Soc. 26, 849–859 (2006) 26. I. Yanase, T. Ohtaki, M. Watanabe, Combinatorial study on nano-particle mixture prepared by robot system. Appl. Surf. Sci. 189, 292–299 (2002) 27. T.A. Stegk, R. Janssen, G.A. Schneider, High-throughput synthesis and characterization of bulk ceramics from dry powders. J. Comb. Chem. 10, 274–279 (2008) 28. Y. Zhan, L. Chen, S. Yang, J.R.G. Evans, Thick film ceramic combinatorial libraries: the substrate problem. QSAR Comb. Sci. 26, 1036–1045 (2007) 29. M.M. Mohebi, J.R.G. Evans, A drop-on-demand ink-jet printer for combinatorial libraries and functionally graded ceramics. J. Comb. Chem. 4, 267–274 (2002) 30. Z.-L. Luo, B. Geng, J. Bao, C. Gao, Parallel solution combustion synthesis for combinatorial materials studies. J. Comb. Chem. 7, 942–946 (2005) 31. T.-S. Chan, C.-C. Kang, R.-S. Liu, L. Chen, X.-N. Liu, J.-J. Ding, J. Bao, C. Gao, Combinatorial study of the optimization of Y2 O3 :Bi. Eu Red Phosphors. J. Comb. Chem. 9, 343–346 (2007) 32. T.-S. Chan, Y.-M. Liu, R.-S. Liu, Combinatorial search for green and blue phosphors of high thermal stabilities under UV excitation based on the K(Sr1−x−y )PO4 :Tb3+ x Eu2+ y system. J. Comb. Chem. 10, 847–850 (2008) 33. J. Wang, J.R.G. Evans, London University Search Instrument: a combinatorial robot for highthroughput methods in ceramic science. J. Comb. Chem. 7, 665–672 (2005) 34. R.C. Pullar, Y. Zhang, L. Chen, S. Yang, J.R.G. Evans, N. McN, Alford, manufacture and measurement of combinatorial libraries of dielectric ceramics, part I: physical characterisation of Ba1−x Srx TiO3 libraries. J. Eur. Ceram. Soc. 27, 3861–3865 (2007) 35. R.C. Pullar, Y. Zhang, L. Chen, S. Yang, J.R.G. Evans, P.Kr. Petrov, A.N. Salak, D.A. Kiselev, A.L. Kholkin, V.M. Ferreira, N.McN. Alford, Manufacture and measurement of combinatorial libraries of dielectric ceramics, part II: dielectric measurements of Ba1−x libraries. J. Eur. Ceram. Soc. 27, 4437–4443 (2007) 36. R.C. Pullar, Y. Zhang, L. Chen, S. Yang, J.R.G. Evans, A.N. Salak, D.A. Kiselev, A.L. Kholkin, V.M. Ferreira, N. McN, Alford, dielectric measurements on a novel Ba1−x (BCT) bulk ceramic combinatorial library. J. Electroceram. 22, 245–251 (2009) 37. J.C.H. Rossiny, S. Fearn, J.A. Kilner, Y. Zhang, L. Chen, Combinatorial searching for novel mixed conductors. Solid State Ion. 177, 1789–1794 (2006) 38. B. Wessler, V. Jehanno, W. Rossner, W.F. Maier, Combinatorial synthesis of thin film libraries for microwave dielectrics. Appl. Surf. Sci. 223, 30–34 (2004) 13 Combinatorial Materials Science, and a Perspective on Challenges . . . 269 39. M.L. Green, P.K. Schenck, K.-S. Chang, J. Ruglovsky, M. Vaudin, Higher-κ dielectrics for advanced silicon microelectronic devices: a combinatorial research study. Microelectron. Eng. 86, 1662–1664 (2009) 40. R.-P. Herber, C. Schröter, B. Wessler, G.A. Schneider, High throughput screening of piezoelectric response of ferroelectric thin films with automated scanning probe microscopy. Thin Solid Films 516, 8609–8682 (2008) 41. J.L. Jones, A. Pramanick, J.E. Daniels, High-throughput evaluation of domain switching in piezoelectric ceramics and application to PbZr 0.6 doped with La and Fe. Appl. Phys. Lett. 93, (152904) (2008) 42. T. Chikyow, P. Ahmet, K Nakajima, T. Koida, M. Takakura, M. Yoshimoto, H. Koinuma, A combinatorial approach in oxide/semiconductor interface research for future electronic devices. Appl. Surf. Sci. 189, 284-291 (2002) 43. S. Guerin, B.E. Hayden, D. Pletcher, M.E. Rendall, J.-P. Suchsland, L.J. Williams, Combinatorial approach to the study of particle size effects in electrocatalysis: synthesis of supported Gold nanoparticles. J. Comb. Chem. 8, 791–798 (2006) 44. P. Cong, A. Dehestani, R. Doolen, D.M. Giaquinta, S. Guan, V. Markov, D. Poojary, K. Self, H. Turner, W.H. Weinberg, Combinatorial discovery of oxidative dehydrogenation catalysts within the Mo-V-Nb-O system. Proc. Natl. Acad. Sci. USA 96, 11077–11080 (1999) 45. S.J. Henderson, A.L. Hector, M.T. Weller, High throughput synthesis of pigments by solution deposition. Mater. Res. Soc. Symp. Proc. 848(FF3.17), 151-156 (2005) 46. J. Scheidtmann, A. Frantzen, G. Frenzer, W.F. Maier, A combinatorial technique for the search of solid state gas sensor materials. Meas. Sci. Tech. 16, 119–127 (2005) 47. K. Takada, K. Fujimoto, T. Sasaki, M. Watanabe, Combinatorial electrode array for highthroughput evaluation of combinatorial library for electrode materials. Appl. Surf. Sci. 223, 210–213 (2004) 48. C.H. Olk, Infrared screening of combinatorially prepared hydrogen sorbing metal alloys. Mater. Res. Soc. Symp. Proc. 801, 75–88 (2003) 49. Y. Matsumoto, M. Murakami, T. Shono, T. Hasegawa, T. Fukumura, M. Kawasaki, P. Ahmet, T. Chikyow, S. Koshihara, H. Koinuma, Room-temperature ferromagnetism in transparent transition metal-doped titanium dioxide. Science 291, 854–6 (2001) 50. R.B. van Dover, L.F. Schneemeyer, R.M. Fleming, Discovery of a useful thin film dielectric using a combinatorial-spread approach. Nature 392, 162–164 (1998) 51. H. Chang, I. Takeuchi, X.-D. Xiang, A low loss composition region identified from a thin film composition spread of (Ba1-x-y Srx Cay )TiO3 . Appl. Phys. Lett. 74, 1165–1167 (1999) 52. G. Briceño, H. Chang, X. Sun, P.G. Schultz, X.-D. Xiang, A class of Cobalt Oxide magnetoresistance materials discovered with combinatorial synthesis. Science 270, 273–275 (1995) 53. R.C. Pullar, Combinatorial bulk ceramic magnetoelectric composite libraries of strontium hexaferrite and barium titanate. ACS Comb. Sci. 14, 425–433 (2012) 54. C. Gao, B. Hu, I. Takeuchi, K.-S. Chang, X.-D. Xiang, G. Wang, Quantitative scanning evanescent microwave microscopy and its applications in characterization of functional materials libraries. Meas. Sci. Technol. 16, 248–260 (2005) 55. U. Simon, D. Sanders, J. Jockel, C. Hepel, T. Brinz, Design strategies for multielectrode arrays applicable for high-throughput impedance spectroscopy on Novel gas sensor materials. J. Comb. Chem. 4, 511–515 (2002) 56. I. Takeuchi, W. Yang, K.-S. Chang, M. Aronova, R.D. Vispute, T. Venkatesan, L.A. Bendersky, Monolithic multi-channel UV detector arrays and continuous phase evolution in Mgx Zn1-x O composition spreads. J. Appl. Phys. 94, 7336–7340 (2003) 57. Y.K. Yoo, F. Duewer, H. Yang, D. Yi, J.-W. Li, X.-D. Xiang, Room-temperature electronic phase transitions in the continuous phase diagrams of perovskite manganites. Nature 406, 704–708 (2000) 58. P.-A.W. Weiss, C. Thome, W.F. Maier, MS-express: data-extracting and -processing software for high-throughput experimentation with mass spectrometry. J. Comb. Chem. 6, 520-529 (2004) 270 R.C. Pullar 59. J. Klein, S.A. Schunk, IR-SensographyTM—expanding the scope of contact-free sensing methods. Meas. Sci. Tech. 16, 221–228 (2005) 60. U. Simon, D. Sanders, J. Jockel, T. Brinz, Setup for high-throughput impedance screening of gas-sensing materials. J. Comb. Chem. 7, 682–687 (2005) 61. Combinatorial and artificial intelligence methods in materials science, in MRS Proceedings Volume 700 (2001), http://www.mrs.org/publications/epubs/proceedings/fall2001/s/ 62. M.Z. Pesenson, S.K. Suram, J.M. Gregoire, Statistical analysis and interpolation of compositional data in materials science. ACS Comb. Sci. 17, 130–136 (2015) 63. A.G. Kusne, T. Gao, A. Mehta, L. Ke, M.C. Nguyen, K.-M. Ho, V. Antropov, C.-Z. Wang, M.J. Kramer, C. Long, I. Takeuchi, On-the-fly machine-learning for high-throughput experiments: search for rare-earth-free permanent magnets. Sci. Rep. 4, 6367 (2014) 64. G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, R. Ramprasad, Accelerating materials property predictions using machine learning. Sci. Rep. 3, 2810 (2013) 65. A. Yosipof, O.E. Nahum, A.Y. Anderson, H.-N. Barad, A. Zaban, H. Senderowitz, Data Mining and Machine Learning Tools for Combinatorial Material Science of All-Oxide Photovoltaic Cells. Mol. Inf. (2015). doi:10.1002/minf.201400174 66. R. Potyrailo, K. Rajan, K. Stoewe, I. Takeuchi, B. Chisholm, H. Lam, Combinatorial and highthroughput screening of materials libraries: review of state of the art. ACS Comb. Sci. 13, 579–633 (2011) 67. K. Rajan, Materials informatics. Mater. Today 8(10), 38–45 (2005) 68. C.J. Long, D. Bunker, X. Li, V.L. Karen, I. Takeuchi, Rapid identification of structural phases in combinatorial thin-film libraries using x-ray diffraction and non-negative matrix factorization. Rev. Sci. Instrum. 80, 103902 (2009) 69. D. Kan, R. Suchoski, S. Fujino, I. Takeuchi, Combinatorial investigation of structural and ferroelectric properties of A- and B-site Co-doped BiFeO3 thin films. Integr. Ferroelectr. 111, 116–124 (2009) Chapter 14 High Throughput Combinatorial Experimentation + Informatics = Combinatorial Science Santosh K. Suram, Meyer Z. Pesenson and John M. Gregoire Abstract Many present, emerging and future technologies rely upon the development high performance functional materials. For a given application, the performance of materials containing 1 or 2 elements from the periodic table have been evaluated using traditional techniques, and additional materials complexity is required to continue the development of advanced materials, for example through the incorporation of several elements into a single material. The combinatorial aspect of combining several elements yields vast composition spaces that can be effectively explored with high throughput techniques. State of the art high throughput experiments produce data which are multivariate, high-dimensional, and consist of wide ranges of spatial and temporal scales. We present an example of such data in the area of water splitting electrocatalysis and describe recent progress on 2 areas of interpreting such vast, complex datasets. We discuss a genetic programming technique for automated identification of composition-property trends, which is important for understanding the data and crucial in identifying representative compositions for further investigation. By incorporating such an algorithm in a high throughput experimental pipeline, the automated down-selection of samples can empower a highly efficient tiered screening platform. We also discuss some fundamental mathematics of composition spaces, where compositional variables are non-Euclidean due to the constant-sum constraint. We describe the native simplex space spanned by composition variables and provide illustrative examples of statistics and interpolation within this space. Through further development of machine learning algorithms and their prudent implementation in the simplex space, the data informatics community will establish methods that derive the most knowledge from high throughput materials science data. S.K. Suram · M.Z. Pesenson · J.M. Gregoire (B) Joint Center for Artificial Photosynthesis, California Institute of Technology, Pasadena, CA 91125, USA e-mail: gregoire@caltech.edu © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5_14 271 272 S.K. Suram et al. 14.1 Tailoring Material Function Through Material Complexity: The Utility of High Throughput and Combinatorial Methods Many technological industries ranging from manufacturing to renewable energy rely on the discovery of new high-performance solid state materials. A common approach to discovery of advanced materials is through increasing chemical complexity, for example through the incorporation of several elements into a single material. This long-standing approach in materials research traditionally involves the synthesis and evaluation of one composition at a time. Most of the single-element and binarycomposition spaces were effectively investigated in the 20th century by this lowthroughput method, and the frontier has thus been pushed to higher order ternary, quaternary, etc. composition spaces. Due to the vast number of possible sets of elements and compositions in a given composition space, systematic experimental investigation of these high-order composition spaces require sophisticated tools for high throughput synthesis and evaluation of new compositions. Recent advancements in experimental methods for the rapid synthesis of material libraries and rapid measurement of material properties are yielding vast ensembles of complex data [14, 17, 19, 46, 49, 60, 66]. A tenet of materials science is the development of compositionproperty relationships, and the automated identification of relationships within high throughput datasets requires the development of new informatics tools. In this chapter we discuss a high throughput experimental pipeline which motivates the development of specific informatics tools. In particular, we note the importance of tiered screening wherein a high throughput pipeline contains a series of experimental measurements that operate at disparate sample throughput. To avoid bottlenecks, a sample down-selection method must be implemented. The informatics challenge arises in the automated identification of a subset of samples for lower throughput measurements such that the selected subset retains maximal “information content,” or maximal ability to establish composition-property relationships with the incomplete dataset. With high throughput datasets in hand, analysis of compositional trends requires prudent practices for the statistical analysis of compositional data. We review unique attributes of compositional data, and through illustrative examples, show that informatics and statistical algorithms for compositional data must account for the non-Euclidean nature of compositional variables. 14.2 Materials Datasets as an Instance of Big Data High throughput materials science requires handling enormous amounts of complex data produced by modern high-throughput experimental technologies. Many modern techniques produce data beyond what can be readily processed, and even fields with well-established data archives and methodologies, such as genomics, are facing new and mounting challenges in data management and exploration. Besides the size in 14 High Throughput Combinatorial Experimentation + Informatics … 273 Table 14.1 The number of unique compositions in a discrete composition library is shown for several values of the number of components n and composition steps δ num. steps 10 20 30 40 50 δ 10 % 5% 3.33 % 2.5 % 2% n n n n n =2 =3 =4 =5 =6 11 66 286 1,001 3,003 21 231 1,771 10,626 53,130 31 496 5,456 46,376 324,632 41 861 12,341 135,751 1,221,759 51 1,326 23,426 316,251 3,478,761 The number of δ steps (“num. steps”) between 0 and 100 % is also listed bytes, these modern data sets are multivariate, high-dimensional and consist of wide ranges of spatial and temporal scales. All of this severely restricts the capability of traditional approaches to modeling, analysis, and visualization of data. The data not only consist of vectors in Euclidean spaces, but may also include new types of data (e.g. tensor fields), or any of those data types defined not just on a Euclidean space, but on manifolds or graphs as we discuss for compositional variables. To conceptualize the extent of data in explorations of high order composition spaces, one can count the number of discrete composition samples required to cover the composition space with a fixed composition interval. A composition space of n components contains n − 1 degrees of freedom due to closure, the requirement that the individual concentrations sum to 1. Table 14.1 provides the number of unique compositions in an n-component composition space with composition interval δ, for several illustrative values of these parameters. For unexplored composition spaces, using a fine composition step is desirable to mitigate the possibility of missing a new high performance material, and to explore high order composition spaces, high throughput techniques are required. The systematic exploration of the various combinations of the components is an example of combinatorial materials science, and given the vast number of possible combinations and the speed at which they are probed experimentally, we refer to these investigations as high throughput (HiTp). For each composition, experimental investigations may include a scalar measurement of performance (so-called “FOM”, Figure of Merit), multi-dimensional data such as images and spectra, or any combination thereof. In particular, the extent of data may vary for each composition in a given library, and the disparate dimensionality of data increases the data complexity and the challenges for informatics algorithm development. Data mining and machine learning embrace many sophisticated data structures. One of the main difficulties in data mining is caused by dependencies between multiple variables/parameters. Identifying a set of independent variables/parameters can be seen as a particular case of a general approach called data dimensionality reduction. When data points are close to a hyperplane in a Euclidean space, methods such as principal component analysis (PCA) and correspondence analysis (CA) are widely used for dimension reduction. These methods allow one to extract major dependen- 274 S.K. Suram et al. cies between physical variables. In case when data lie on a non-Euclidean space, more sophisticated methods of analysis are required. Complex data sets cannot be adequately understood without detecting various scales that might be present in the signal. However, traditional multiresolution analysis (MRA) tools based on wavelets are restricted mostly to one-dimensional or two-dimensional signals. Thus, in order to accurately extract information from modern data sets, the development of multiscale analysis applicable to functions defined on manifolds and graphs is of great importance. Extending multiresolution analysis from Euclidean to curved spaces and networks presents a significant challenge to applied mathematics. This is an emerging field, which is still being developed. Wavelet-type bases and frames consisting of nearly exponentially localized band-limited functions are imperative for computational harmonic analysis and its applications in statistics, approximation theory, and so on. For the two-dimensional sphere and group of its rotations, frames have already found a number of important applications in statistics and crystallography [6, 18, 58]. An adaptive multiscale approach to data analysis based on synchronization was suggested in [59]. The approach is nonlinear, data driven in the sense that it does not rely on a priori chosen basis, and can be extended to automatically determine the scale for complex signals defined on graphs/manifolds (regarding analysis on compact manifolds, see also remarks in section “Composition Spread and Distances” in this chapter). Overall, MRA is a necessary, indispensable approach to efficient representation/analysis of complex information (signals, images, etc.) produced in high throughput and combinatorial experiments. Traditional statistical methods may lead to erroneous dependencies and incorrect inferences when applied to modern complex data, as we demonstrate in this chapter. But even if data consist of usual vectors in a Euclidean space, there are still many open issues. One of them is related to the so-called null-hypothesis testing (NHST). It has been recognized lately that it is necessary to perform NHST to more instructional effect sizes, confidence intervals (CIs) by applying Meta-analysis [7, 15, 30]. Although CIs are more informative than NHST approach since, in some form, they quantify the uncertainty, their meaning is often misunderstood. In fact, CIs are intimately connected with NHST, and both are superseded by Bayesian techniques [42, 43, 73]. The complexity of data calls for application and development of adequate techniques, which are more powerful than the conventional ones and tailored to specific types of experimental data. Statistical methods are often considered simply a toolbox and are consequently utilized superficially in data analysis. To make full use of the deluge of complex data, researchers must transcend the notion of toolbox statistics and engage in the independent applied science of statistics and informatics. Combinatorial Science is Data-Driven and its main premise is that discovery and optimization of materials can be made efficient if directed by statistical inference based on the experimental data. In other words, Combinatorial Materials Science cannot be truly realized without modern statistics and more generally, informatics. Informatics here refers to the management of complete data-lifecycle: the storage, integration, compression of data as well as quantification of uncertainty and mining/analysis of data via statistical learning and data mining techniques. 14 High Throughput Combinatorial Experimentation + Informatics … 275 Moreover, experimental techniques for the generation of data are often developed independent from the development of analysis techniques for that data. Statistical analysis should not be subsequent, but should rather be a part of the experimental design [12]. In this chapter we describe the development of both experimental techniques and analytical methods, and while these developments cannot take place strictly simultaneously, we note the importance of iterative developments of both sides of the high throughput methods. 14.3 High Throughput Experimental Pipelines: The Example of Solar Fuels Materials Discovery A high throughput experimental pipeline is comprised of a network of experimental methods that are interlinked in a process workflow to enable a complete cycle of high throughput experiments [26]. The high-level summary of a high throughput pipeline for the discovery of solar fuels materials is shown in Fig. 14.1. The pipeline contains 3 primary sectors: the synthesis of material libraries, the screening of materials via measurements of material performance, [29, 31, 32, 34, 39, 51, 74, 75] and the characterization of materials [28]. The materials screening portion is split according to two primary types of functional materials for solar fuels technology, and the screening of light absorber and of electrocatalyst materials each involves unique experiments. Several data-related aspects of the pipeline that are not shown are data management, data analysis, and design of experiments. The informatics-based aspect that is shown relates to the active down-selection of samples. That is, to create a throughput-matched series of screening experiments, a higher throughput coarse screening method is coupled to a lower throughput fine screening method through the judicious selection of a subset of the samples. The 3 electrocatalyst screening experiments listed in Fig. 14.1 have been described in recent publications with the higher throughput method being the parallel imaging of O2 bubbles produced by electrocatalysis of the oxygen evolution reaction (OER) [75]. The two other experiments are serial experiments performed by a scanning drop cell (SDC) device Fig. 14.1 (top) Sectors of the accelerated discovery pipeline, with the screening sector split for the 2 general material functions of light absorption and electrocatalysis [26]. (bottom) Tiered screening experiments are shown for evaluating electrocatalyst libraries where sample down-selection occurs between subsequent screening experiments 276 S.K. Suram et al. to quantify OER electrocatalytic activity [29]. These experiments include the collection of cyclic voltammogram with rapid sweep rate and then a longer measurement of catalyst overpotential at a fixed current density, with the experiment duration being sufficiently long to demonstrate that any anodic current could not be dominated by a sample corrosion process. While the throughput of each technique depends on the choice of experimental parameters, for the screening of material libraries with approximately 1800 composition samples on a library plate, the throughput of each stage is approximately 180, 10 and 2 samples per minute, respectively. While some throughput matching can be achieved through duplication of instruments for performing the lower-throughput techniques, practical throughput matching is attained through sample down-selection at each juncture. Transition from the screening portion of the pipeline to materials characterization often involves another substantial down-selection of samples. While development of HiTp materials characterization [27, 35, 40] and related analysis techniques [44, 45] is an active field of research, the characterization throughput is often lower than the final screening in a tiered screening pipeline. For a given composition region, a systematic variation in a materials characterization attribute may correspond to a variation in the performance metric. By partitioning a composition space into regions which exhibit systematic trends in performance, samples can be selected for detailed characterization to capture the attribute-property relationships both within and among the composition regions. Implementing this strategy into a down-selection algorithm is a primary goal of informatics for high throughput pipelines. 14.4 An Illustrative Dataset: Ni-Fe-Co-Ce Oxide Electrocatalysts for the Oxygen Evolution Reaction In the following sections, we present the challenges and initial progress in two areas of informatics related to high throughput materials discovery: the automated downselection of samples in a tiered screening pipeline, and the statistical analysis of compositional variables as a critical aspect of identifying composition-property relationships. Both of these discussions will use simple, synthetic datasets as illustrative examples. In addition, examples will be provided using an experimental dataset from the high throughput mapping of OER catalyst activity over a pseudo-quaternary composition space of metal oxides containing all possible combinations of Ni, Fe, Co and Ce with 3.33 at.% intervals. For details on materials synthesis and experimental methods, we refer the reader to previous reports. [31, 34] Here we provide a map of a primary figure of merit for OER electrocatalysts for solar fuels applications, the overpotential required to provide 10 mA cm−2 geometric catalytic current density. The results are summarized in Fig. 14.2 with Fig. 14.2a showing an example map of the FOM for an array of samples on a library plate, which are mapped onto composition space in Fig. 14.2b. The composition mapping of the pseudo-quaternary spread is performed as a stack of pseudo-ternary triangles with increasing Ce concentration. 14 High Throughput Combinatorial Experimentation + Informatics … 277 Fig. 14.2 A FOM for solar fuels applications, the overpotential for delivering OER geometric current density of 10 mA cm−2 , is measured on composition libraries a and mapped to composition space b using a false color scale c. The (Ni-Fe-Co-Ce)Oz composition space is shown as a stack of Ni-Co-Ce composition plots with increasing Ce concentration [26] The common FOM color scale is shown in Fig. 14.2c. This dataset is most representative of the third tier of electrocatalyst screening described above, although it is a full dataset for which we can perform down-selection informatics to choose a sample subset for additional screening or characterization experiments. We can also analyze FOM trends with (sub-)compositional variables, as will be illustrated in the final section. 14.5 Automating Sample Down-Selection for Maximal Information Retention: Clustering by Composition-Property Relationships As described above, HiTp experimentation typically involves the coarse, rapid measurement of a FOM or property of interest for each sample in a material library. Appropriate down-selection methods are essential to ensure generation of information-rich experimental data that lead to knowledge and discovery. While a combinatorial material library may include variation of a number of process parameters such as synthesis temperature or processing parameters, [11, 13] we continue this discussion in the context of composition libraries. For demonstration purposes, a synthetic dataset with four distinct composition regions that are governed by different composition-property relationships is shown in Fig. 14.3a and the resulting down-selection using top ‘z’ percentile performing compositions is shown in Fig. 14.3b. It is evident that down-selection based on top ‘z’ 278 S.K. Suram et al. Fig. 14.3 a A ternary composition space (with 5 at.% step) is partitioned into 4 property fields (left), and a synthetic composition-property plot is obtained by applying distinct polynomial functions to the compositions of each property field (right), b down-selection of compositions by selecting top ‘z’ percentile of compositions based on their property value. The downselected compositions, colored red, are very sensitive to the choice of ‘z,’ which is usually fixed based on throughput matching of successive experiments. The property field boundaries are overlaid for comparison and c Clustering of the ternary composition library using a Euclidean distance metric on the property space (left) and composition-property space (right). Clustering using only the property yields clusters with compositions scattered over the library, while adding the compositions to the clustering metric yields clusters that are mostly connected in composition space but do not match the original property fields, whose boundaries are overlaid for comparison performers is highly sensitive to the value of z, typically imposed by the throughput capabilities of the HTE workflow and more importantly are incapable of capturing the composition-property relationships thus necessitating the need for more sophisticated partitioning/clustering techniques. 14 High Throughput Combinatorial Experimentation + Informatics … 279 Traditional clustering techniques such as k-means clustering on either composition-property space or property space (Fig. 14.3c) alone that depend on spatial statistics are incapable of capturing composition-property relationships. In this context, we discuss the role of evolutionary statistical methods and information theory concepts in identifying several composition-property relationships and generating information rich experimental databases. 14.5.1 Down-Selection for Maximal Information Content The above discussion on tiered pipeline screening introduces the importance of down-selection for maximal information content in the context of a high-throughput workflow. From a materials discovery point of view, maximal information downselection in a HiTp pipeline allows generation of information-rich experimental databases that allow us to extract knowledge pertinent to composition-(micro) structure-experimental parameters-property relationships. This knowledge is typically un-accessible via other main facets of materials design based discoveries namely, first principles computations and materials informatics as applied to existing databases. Thus, generation of information rich experimental databases provides a unique opportunity to exploit the capabilities of HiTp to simultaneously perform exploratory and knowledge based search for new materials. These experimental databases can be used as an input to data mining methods to extract empirical relationships, they also form an important resource to develop sophisticated first principles based models that are applicable to higher order compositions spaces and in-operando conditions. Distance and density based clustering approaches that are typically ubiquitous in the field of clustering and have been successfully applied for materials science applications where spatial statistics are relevant are inapplicable for partitioning the composition space for maximizing information content. Alternately, information theory based metrics provide access to higher order statistics [22, 36, 37] necessary for clustering/classification in complex data structures. Specifically, Shannon entropy criterion has been successfully applied as a supervised classification algorithm for unravelling crystal chemistry design rules [41] and discovery of materials [4]. The selection, crossover and mutation based evolutionary operations of genetic programming enable complex data relationships to be captured as genetic trees, resulting in its application for supervised classification of complex data [52]. Other evolutionary techniques such as genetic algorithms [5] and particle-swarm optimization [72] have also been used for clustering data. However, they use cluster variance-based objective functions and thus are unable to capture non-hyperspherically shaped clusters which are typical of phase/property fields in materials science. While several data mining algorithms have been applied to a) capture the function relating the input and output variables [9, 68] and/or b) cluster data based on input variables [63] and/or c) classify complex data structures in supervised classification; [47] these approaches are insufficient to cluster data based on the (dis)similarity 280 S.K. Suram et al. in the function relating input and output variables. For this purpose an approach that is capable of capturing and classifying several underlying composition-property relationships is required. Mathematically, this is achieved by identifying clusters that maximize divergence amongst composition-property relationships described by them. Genetic programming is a well-accepted and robust methodology for capturing functional relationships, whereas, divergence is measured using information theory based concepts. In the following sections, the concepts of multi-tree genetic programming as applied to a materials discovery problem using an information theory based objective function are introduced and refined. We utilize genetic programming trees to represent the functions that map compositions and HiTp property measurements to memberships in a fixed number of clusters. The clustering is defined over the composition space such that the optimized trees cluster the compositions based on the functional relationships between composition and measured property. This method of clustering allows selection of representative compositions from each cluster for further investigation and characterization, resulting in an information rich experimental materials genome with respect to composition-characterization attribute-property relationships. 14.5.2 Information-Theoretic Approach Using information-theoretic approach, clustering composition space such that the similarity of composition-property relationships among different clusters is minimized while similarity of composition-property relationships within a given cluster is maximized can be represented as minimizing cross “between cluster” information potential while maximizing self “within cluster” information potential. An attractive metric to achieve this for a two class system is the Cauchy-Schwarz divergence [38, 64] and is expressed as  p1 (x) p2 (x)dx , (14.1) Dcs ( p1 , p2 ) = − ln   2 p1 (x)dx p22 (x)dx where pk (x) is the probability distribution of x in class Ck and x is the (multidimensional) composition coordinate. In case of discrete data; probability distribution functions can be estimated using a Parzen window [38] with a Gaussian kernel: p(x) =   n  (x − xi ) 2 1 1 G(x − xi , σ 2 ) , where G(x − xi , σ 2 ) = exp − n (2π σ 2 )d/2 2σ 2 i=1 (14.2) The kernel width, σ , is an apriori specified parameter; n is number of observations; d is the dimension of the dataset. 14 High Throughput Combinatorial Experimentation + Informatics … 281 Using (14.2), Jenssen et al. [37] show that the divergence function of (14.1) can be estimated as   G ij,2σ 2 Dcs ( p1 , p2 ) ≈ − ln   xi ∈C1 x j ∈C2 xi ,xi ∈C1 G ii ,2σ 2  x j ,x j  ∈C2 (14.3) G jj ,2σ 2 where G ij,σ 2 = G(xi − x j , σ 2 ) The fact that every composition in a composition library belongs to exactly one property field is imposed using a membership value (i m k ) for data point i in cluster k as i m k  = 1 for k  = k & i m k  = 0 for k   = k. (14.4) And, i m is defined as the vector of membership values for a data point i over the set of clusters. Using these membership notations, [8] extend the Cauchy-Schwarz divergence function to a c-cluster problem (c ≥ 2), as: 1 n i, j=1 (1 2 Dcs ( p1 , p2 , . . . pc ) ≈ −ln  ck=1 n − i mT j m)G ij,2σ 2 (14.5) j i i, j=1 m k m k G ij,2σ 2 In this objective function, the denominator scales as a power of the number of clusters (c) whereas the numerator varies comparatively very slowly with c, resulting in a denominator dominated objective function with increasing number of clusters. Therefore, we introduce a modified form of Cauchy-Schwarz divergence function such that the numerator and denominator remain invariant to the number of clusters and is described as:  n c i T j m)G ij,2σ 2 i, j=1 (1 − m c−1 (14.6) Dcs ( p1 , p2 , . . . pc ) ≈ −ln 2c1  c ck=1 i,n j=1 i m k j m k G ij,2σ 2 To introduce the modified Cauchy-Schwarz divergence function (14.6) in an optimization algorithm, a continuous membership function is required, because the binary membership defined in (14.4) does not provide a continuous divergence function with respect to alterations in membership of a given data point in a given cluster. Further, to accurately cluster property fields, the membership values should be based on composition-property relationships. To facilitate this, continuous membership values in the range [0, 1] are introduced by defining a membership function m k (xf) for each cluster such that i m k = m k (xfi ) . The domain for the probability distribution functions for Parzen window estimation is the composition space which enables compositional connectedness in the clusters, whereas the domain for mem- 282 S.K. Suram et al. bership functions is a combined composition and property space, with coordinate represented as xf which enables the membership functions to represent compositionproperty relationships. Additionally, by constraining the membership values to sum to one, they can be regarded as a set of posterior probabilities: m k (xf) = P (Ck |xf), c  m k (xf) = 1. (14.7) k=1 14.5.3 Genetic Programming Based Clustering Genetic trees are computer programs capable of learning complex relationships present in the data. In a c-class dataset, there are c functional relationships between composition and property that need to be learnt or distinguished from each other. Thus, we utilize a multi-tree genetic programming (MT-GP) framework developed by Muni et al., [52] and [8] such that each tree learns the functional relationship between composition and property for one of the classes in the data. In this representation each tree (Tk ) is defined on the composition-property space where the scalar is used Tk (xf) to generate membership values, m k (xf) as described below. This reduces the functional relationship based clustering problem to optimal identification of composition-property relationships by MT-GP such that the resulting membership values maximize the Cauchy-Schwarz divergence function (14.6). The algorithm is based upon the construct illustrated in Fig. 14.4, where each cluster is represented by a hierarchical tree of root, leaf and terminal nodes in the MT-GP chromosome. The leaf nodes and the root nodes are chosen from the set of operators {+, −, x, ÷}. The terminal nodes are numerical and the domain includes the composition, property parameter space and random integer constants in [0, 10]. For the tree representing a cluster k, the sequence of operators that terminate with numeric values comprise a nested algebraic function Tk (xf). Initialization, mutation, selection, crossover and termination proceed using standard genetic programming techniques and are discussed elsewhere, [69] although crossover in a MT-GP approach differs from crossover in traditional genetic programming. A crossover between any two selected parent chromosomes with ‘c’ trees can occur in c C2 ways because the kth tree in chromosome ‘i’ does not have to crossover with the kth tree in chromosome ‘ j’ given that genetic tree-property field mapping is not necessarily the same for all chromosomes. For each pair of multitree chromosomes selected as parents (using a probability pcross here, set to 1), pairs of trees are randomly selected with one tree from each of the parent chromosomes contributing to the pair such that every tree in the parent chromosomes is present in exactly one pair. To balance between exploratory and exploitative capabilities of genetic programming, we define a base probability (ptreecross ) and probability multiplier (pcm ) such that the probability for crossover of the kth randomly selected pair of trees for a given pair of parent chromosomes is pcm . Values of ptreecross in the range 14 High Throughput Combinatorial Experimentation + Informatics … 283 Fig. 14.4 A schematic of a multi-tree chromosome in an MT-GP approach for 3 clusters and maximum depth 3. Abbreviations used are TN: terminal node, LN: leaf node, RN: root node 0.6–0.8 and ptreecross × (pcm )k−1 in the range 0.8–1.0 are found to be reasonable estimates to ensure robust convergence. However, further research is required to identify optimal values of these parameters using various case studies. 14.5.4 Calculating Membership Boric and Estévez [8] related the output of the trees Tk (xf) to membership values m k (xf) using a Sigmoid transformation followed by normalization: Tk (xf) = Tk (xf) 1 & m (xf) = k c  1 + e−Tk (xf) Tk (xf) (14.8) k  =1 The scalar outputs from different trees could be of varying magnitudes depending on the distinct composition-property function they map, and thus could result in membership values that are skewed towards a particular function. To avoid this, relative memberships within each class are obtained by first normalizing the output of the trees Tk (xf) with respect to the minimum and maximum values of Tk (xf) for a 284 S.K. Suram et al. given ‘k’ and then normalizing the relative memberships such that m k (xf) represent posterior probabilities (14.9).   Tk (xf) = Tk (xf) − Tkmin (xf) T (xf) & m k (xf) = c k max min   Tk (xf) − Tk (xf) Tk  (xf) (14.9) k  =1 The most representative class label set k(x) is computed using k(x) = argmax(m k (xf)) (14.10) k To ensure that variations in each feature vector are given equal importance, the composition vectors and the property vectors need to be converted to unit standard deviation prior to the MT-GP analysis. 14.5.5 Application to a Synthetic Library Figure 14.5 shows the optimal membership set obtained after clustering the dataset shown in Fig. 14.1 assuming the presence of four clusters, using a Gaussian kernel width, σ = 0.17 at. %. Figure 14.5 also shows the clustering of compositions based on their maximum membership class ( k(x)). Given that the number of property fields in the synthetic dataset and the number of clusters used in the MT-GP algorithm are equal, the association of a synthetic property field and calculated cluster is easily made by evaluating the maximum intersection of the composition points. The clusters in Fig. 14.5 are colored corresponding to the association of property fields in Fig. 14.1, and comparison between these composition maps reveals 14 misclassified samples, approximately 8 % of the data points. The misclassified samples lie on the boundaries Fig. 14.5 (left) Maps of the membership of each composition in the four optimized MT-GP trees. (right) The four clusters obtained by taking the maximum membership for each composition with the property field boundaries from Fig. 14.1 overlaid for comparison. The 14 misclassified compositions are marked by red borders 14 High Throughput Combinatorial Experimentation + Informatics … 285 between different property fields, where the continuous membership parameters show partial membership in each of the neighboring fields. That is, the MT-GP algorithm produces the correct property fields with the boundaries blurred by 1 or 2 composition intervals. 14.5.6 Experimental Dataset To demonstrate structure-property relationship clustering on experimental data, we use the (Ni-Fe-Co-Ce)Ox catalyst performance dataset from Fig. 14.2. The 5429 FOM values and corresponding 4-component compositions are used as the input for the MT-GP algorithm with 4 trees, each with maximum depth 4, and σ = 0.17. We choose 4 clusters (4 trees) to demonstrate the capability of our algorithm to capture important composition-FOM relationships. One of the essential genetic operators is division which allows capturing of complex composition-property relationships. However, this adds special constraints for treatment of compositions along ternary faces, binary lines, and unary end points which have at least one composition component as zero. To avoid division by zero, we shift all the compositions by  = 0.01 at. %. Using maximum membership to define representative clusters, the stacked-ternary representation of the 4 optimal clusters obtained from MT-GP is shown in Fig. 14.6. Fig. 14.6 Mapping of the most representative cluster onto quaternary compositions in a (Ni-FeCo-Ce)Ox library 286 S.K. Suram et al. In any experiment dataset of composition-property information, there is no known optimal solution for composition clusters. For the dataset of Fig. 14.2, two unique, highly-active catalyst composition regions have been identified and classified through additional electrochemical characterization [33]. The recently discovered catalyst composition region contains little to no Fe and approximately 50 % Ce, which is identified as the α cluster. Traditional mixed-transition-metal oxides with at least approximately 50 % Ni comprise the low-Ce region of highly active catalysts, which is identified as the χ cluster. The MT-GP algorithm provides information for two other clusters with lower catalytic activity. Given that the FOM explored is convoluted due to experimental noise and has limited dynamic range, the excellent clustering results suggest that the MT-GP algorithm can be successfully deployed for automated downselection routines. While further research is necessary to develop a non-parametric MT-GP based clustering algorithm, the approach presented establishes a protocol for identifying distinct, complex composition-property relationship fields from combinatorial materials science data and presents a significant step towards developing information rich experimental materials genomes. In addition, the compositional connectedness of clusters is encouraged by Euclidean metric based Gaussian kernels. While Gaussian kernels capture clusters effectively for the test cases demonstrated, compositional data are defined on the simplex, as described below, requiring additional development of clustering and down-selection methods. 14.6 The Simplex Sample Space and Statistical Analysis of Compositional Data A primary objective of Combinatorial Materials Science is to unravel the composition dependence of materials properties. Probably no other field has so much of its data intrinsically expressed as percentages (compositional data) as do chemistry and combinatorial materials science. Since the percentages sum to a constant the composition sample space is not the usual Euclidean space. Indeed, the constraint of constant sum doesn’t allow the components of a composition to vary from –∞ to ∞ and a composition of N elements is confined to a restricted part of the Euclidean space called the N xk = 1} [1, 2, 53]. Conventional statistical analysis simplex S N = {x; xk ≥ 0, k=1 of such data does not incorporate the inherent relationships between the elements, even though they are crucial for the physics and chemistry of materials. Moreover, conventional processing of compositional data introduces artifacts such as spurious correlation, while compositional statistical methods enable more accurate extraction of composition-property relationships. The relationships between compositions and their properties are intrinsically multivariate and compositional data require special methods of processing. 14 High Throughput Combinatorial Experimentation + Informatics … 287 Fig. 14.7 Demonstration of subcompositional incoherence using the dataset of Fig. 14.2. For each Fe concentration, the lowest overpotential value from the set of samples with that Fe concentration is shown under 2 subcompositional representations of the quinary oxides: (black) quantification of Fe, Co, Ce and Ni and (red) quantification of Fe, Co and Ce In this section we present the importance of CDA for materials science. The concepts and importance of closure and sub-compositional incoherence are discussed, and to demonstrate the consequences of these concepts for interpretation of experimental data, we begin with an analysis of composition trends within the data of Fig. 14.2. An illustrative, practical compositional analysis is to evaluate the relationship between Fe concentration and electrocatalytic activity. To generate Fig. 14.7, we consider the discrete Fe concentration intervals of 3.33 at.% and for the set of samples with a given Fe concentration, extract the lowest overpotential value. The corresponding compositional trend indicates how good a catalyst can be with a given Fe concentration. An important realization about the discussion of this composition library is that the above figures have only considered the composition of the 4 cations, as the oxygen stoichiometry is unknown. That is, the samples have been treated as quaternary subcompositions of the quinary parent compositions. Figure 14.7 shows the composition trend calculated using the Fe-Co-Ce-Ni subcomposition space and the analogous trend using Fe-Co-Ce compositions, which may result from an experiment wherein the Ni concentrations are unknown. The striking differences between the overpotential trends using these two subcomposition spaces highlight an inherent complexity of compositional data. Using illustrative synthetic datasets and mathematical descriptions of CDA, we demonstrate that Euclidean-based correlation structure should not be used to interpret associations among measured elemental concentrations. In particular, we demonstrate induced correlations and subcompositional incoherence of the Pearson correlation coefficient. These effects are caused by the constant sum constraint that restricts the sampling space to a simplex instead of the usual Euclidean space. Since statistical measures such as mean, standard deviation, etc., are defined for the Euclidean space, traditional correlation studies, multivariate analysis, and hypothesis testing may lead to erroneous dependencies and incorrect inferences when applied to compositional data. These issues demonstrate that prior to applying usual statistical methods data 288 S.K. Suram et al. should be transformed to remove the constant sum constraint. Logratio transforms remove the data-sum constraint by mapping the components of the compositions into a Euclidean space, thus enabling one to apply classical statistical methods. Moreover, a metric vector space structure can be introduced in the simplex (via the simplicial metric based on logratios), thus enabling meaningful statistical analysis of compositional data. We apply logratio analysis to interpolation of simulated composition data. Comparison of a consistent compositional interpolation based on balances with traditional linear approach reveal discrepancies between their results that are crucial for correct statistical analysis of composition-property relationships. Altogether these results demonstrate the importance of using physically/chemically adequate and mathematically consistent approaches to compositional data, particularly in high-order composition spaces. 14.6.1 The Closure Effects—Induced Correlation The traditional way to describe the pattern of variability of data is through the estimates of the raw mean, covariance, and correlation matrices. Individual components of compositional data are not free to vary independently: if the proportion of one component decreases, the proportion of one or more other components must increase, thus leading to an artificial correlation that is, in fact, caused by the constant sum constraint. Indeed, the closure, or in other words the constant sum constraint, affects correlation between variables. Consider for example a set of N -part compositions that can be treated as a M × N matrix W where N is the number of elements in the composition, and M is the number of measurements, or samples, with the component N wik = 1, where i = 1, . . ., M. Let us denote Yk = wik , i = 1, . . ., M, to sum k=1 be the kth column of the matrix W. Since cov(Yk , Nj=1 Y j ) = 0 we have cov(Yk , Y j ) = −var (Yk ), j  = k (14.11) so the sum of the covariances of any variable is negative. Thus each variable must be negatively correlated with at least one other variable and, in general, there is a strong bias toward negative correlation between variables of (relatively) large variance. One of the critical consequences of closure for materials science is that usual correlation analysis can produce misleading associations between elemental concentrations. This is especially consequential since visualization of results by Composition-StructureProperty correlation maps is so important in materials science [10, 76] 14 High Throughput Combinatorial Experimentation + Informatics … 289 Fig. 14.8 a A set of 100 compositions generated from normal distributions of element quantities with normalization into the quaternary (N = 4), ternary (N = 3) and binary (N = 2) composition spaces. b The correlation of the concentration of element 1 with each other element is shown, with the magnitudes demonstrating induced correlation, and their variation with respect to N showing sub-compositional incoherence [61] 14.6.2 Illustrative Example As an example, consider a set of M materials each containing N elements for which we would like to ascertain if there is correlation in the concentration of element 1 with respect to the other elements. As shown in Fig. 14.8a, a synthetic dataset is created by generating random quantities of the N = 4 elements from normal distributions. Due to the randomness, the element-pairwise correlation over the M = 400 materials is negligible when considering the quantities of the elements, which is nonnormalized data. Measurements of the (normalized) composition of each material produce the M × N closed dataset wik . Using this simulated data, the Pearson correlation coefficient 4 Ck,l of the concentration vectors Y k and Y l (elements k and l) can be calculated, where the superscript 4 indicates the dimension of the composition space (N = 4). Consider an extension of this example in which the concentration of the 4th element cannot be measured so instead composition measurements are made in the N = 3 space and correlations 3 Ck,l are calculated, and a similar exercise can by performed for N = 2. The values of N Ck,l plotted in Fig. 14.8b demonstrate some limitations of the usual statistics. Indeed, the correlation coefficients are skewed towards negative values due to the normalization-induced correlation, as indicated by (14.11). In fact, for the N = 2 case, the correlation coefficient is −1 because due to the normalization xi,2 = 1−xi,1 . In other words, correlation structure of a composition cannot be used to interpret correlations among the measured elemental concentrations and vice versa. It should be mentioned that other distance-based statistics like means, variances and standard deviations, as well as tasks such as clustering and multidimensional scaling have similar limitations. 290 S.K. Suram et al. 14.6.3 Sub-Compositional Coherence n An n-part composition (x1 , x2 , . . ., xn ) with i=1 xi = 1 is called a subcomposition of m xi = 1, if m > n and (x1 , . . ., xn ) is an m-part composition (x1 , x2 , . . ., xm ) with i=1 a subset of the elements (x1 , . . ., xm ). A consequence of the constant sum constraint for compositional data is that sub-compositions may not reflect the variations present in the parent data, and as a result covariance of elements may change substantially between different subsets of the parent data set. Every composition is a sub- or a parent- composition depending on the objective of an experiment or the goal of data analysis. An experimentalist or a data analyst may not be able to take into account all elements (some elements may not be accessible), or may disregard some of the available elements if they are not pertinent to the objective. The following principle of sub-compositional coherence is an important concept of compositional analysis: any compositional data analysis should be done in a way that produces the same results in a sub-composition, regardless of whether we analyzed only that sub-composition or a parent composition. Subcompositional incoherence of Pearson correlation coefficient is demonstrated in Fig. 14.8, where for a given pair of elements, N Ck,l varies with the order N of the composition space. These effects of closure on statistical analysis of compositional data, induced correlations and subcompositional incoherence, make traditional statistical methods invalid, and artificial correlation obtained by applying such techniques may lead to false scientific discoveries and incorrect predictions. Moreover, methods that are based on a correlation matrix of observations, such as: factor analysis, principal component analysis (PCA), cluster analysis, kriging interpolation to name just a few, would lead to inaccurate, warped results. Thus correlation analysis, and multivariate statistical analysis in general, of compositional data require special techniques in order to avoid producing false results. 14.6.4 Principled Analysis of Compositional Data The fundamental building block of statistical analysis is the probabilistic model. A well-defined sample space is one of the basic elements in a probabilistic model, and as noted above, the composition sample space is a simplex [2]. All standard statistical methods assume that the sample space is the entire Euclidean space, while compositional data clearly do not satisfy this assumption. In order to deal with the closure effects described in the previous section, an approach based on a family of transformations, the so-called logratio transformations, has been introduced [1]. These transformations based on logarithm of ratios of compositions map the components of the compositions onto a Euclidean space, thus enabling one to apply classical statistical methods. In what follows, we briefly describe a few key concepts of such analysis [2]. The so-called alr transform is defined for a given N-element composition x as an (N − 1)-element vector z with the following components: 14 High Throughput Combinatorial Experimentation + Informatics … z = alr (x) = (ln(x1 /x N ), . . ., ln(x N −1 /x N )) 291 (14.12) where one of the composition components is chosen as common divisor. This logratio transform is invertible since there is a one-to-one correspondence between any N part composition x and its logratio vector z. This means that any statement about the components of a composition can be expressed in terms of logratios and vice versa. By defining the sum si = j=i ex p(z j − z i ) = j=i x j /xi the transformation from logratio to composition coordinates is given by xi = 1/(si + 1). (14.13) Because alr depends on the choice of x N , this transform is not employed in our calculations and a more suitable transform is discussed below. Later we utilize (14.12) and (14.13) only to illustrate the results of spatial compositional interpolation. To build a vector space structure on the simplex the following operations were introduced by Aitchison. The closure operation C is defined as follows x = C[u 1 , . . ., u N ] = (u 1 /(u 1 + · · · + u N ), . . ., u N /(u 1 + · · · + u N )); u i≥ 0, x ∈ S N ⊂ R N −1 where u i represent the raw data such as element quantities. Perturbation ⊕ is an equivalent of addition in the Euclidean space and defined as w = x ⊕ y = C[x1 y1 , x2 y2 , . . ., x N y N ]; w, x, y ∈ S N ⊂ R N −1 (14.14) Powering is an equivalent of multiplication a vector by a scalar and defined as w = a x = C[x1a , x2a , . . ., x Na ]; x ∈ S N , , a ∈ R Aitchison inner product replaces the Euclidean inner product and defined as < x,y > A = 1/N N N i=1 j>i ln(xi /x j )ln(yi /y j ); x, y ∈ S N ⊂ R N −1 (14.15) √ Thus the norm of a vector, or its simplicial length, is ||x|| = A . This enables one to compute distances between compositional vectors, projections of compositional vectors, etc. The Aitchison distance is defined as N N [ln(xi /x j ) − ln(yi /y j )]2 }1/2 ; x, y ∈ S N ⊂ R N −1 (14.16) Establishing a metric vector space structure in the simplex and utilizing orthonormal bases facilitates application of complex statistical methods to analysis of composid A (x,y) = {1/N i=1 j>i 292 S.K. Suram et al. tional data. The so-called isometric logratio (ilr) transform has important conceptual advantages and enables one to use balances, a particular form of ilr coordinates in an orthonormal basis. A balance is defined as b pq = ([ pq/( p + q)]1/2 )ln(g(x p )/g(xq )) (14.17) where g(·) is the geometric mean of the argument, x p is the group with p parts and xq is the group of q parts which are obtained by sequential binary partition (see [53] and references there). However, there is no obvious ‘optimal’ basis, and the compositional biplot approach should be used to find one [2]. For an analysis to be subcompositionally coherent, it suffices to define variables using the ratios of the composition values. The quantities x1 /x2 and ln(x1 /x2 ) are invariant under changes of the composition order as they quantify the relative magnitudes of elemental concentration rather than their absolute values, though the interpretation of the results in terms of the original variables is not always trivial. To study correlation structure of compositions Aitchison introduced a variation matrix T = {τij } of dimensions N × N , with the elements τij = var [ln(Yi /Y j )] (14.18) When τij are large, there is no proportionality between the corresponding elements. If, however, the elements i and j are exactly proportional then τij = 0. The scale of these variations can be determined by introducing total variance as a normalized sum of the variances of all logratios N N Vtot = 1/(2N ) i=1 j=1 τij (14.18a) The variation matrix T (14.18, 14.18a) is instrumental in the analysis of associations between elemental concentrations in compositions. Such analysis will be discussed in greater detail in our forthcoming paper dedicated to covariance structures of screening libraries [62]. In what follows we apply balances (14.17) to spatial interpolation of compositional data. 14.6.5 Composition Spread and Distances The evaluation of a composition spread is crucial in elucidating the composition dependence of materials properties. As a global measure of spread, one can use the metric variance (also known as total variance or generalized variance) that is the average squared distance from the center to the dataset [70, 71]. There are various measures of spread for compositional data and they all are based on the distance defined in (14.16). that is very different from the Euclidean distance. As it is always the case with non-Euclidean geometries, there is more than meets the eye, so to get 14 High Throughput Combinatorial Experimentation + Informatics … 293 (a) (b) Compositions c1 c2 c3 c4 c5 c6 c7 c1 0.000 1.165 2.297 5.488 4.802 5.302 1.931 0.000 1.132 4.323 3.727 5.457 1.837 0.000 3.191 2.754 5.831 2.373 0.000 1.895 7.727 5.078 0.000 5.927 3.786 0.000 3.620 c2 c3 c4 c5 c6 c7 0.000 Fig. 14.9 Distances and straight lines in the composition space (see the text for details). a straight lines in a simplex. b Log-ratio distance matrix for Aitchison distances dA a better feel for simplex’s geometry, let us consider an illustrative example. Since there is a vector space structure within the simplex, one can define geometric elements such as straight lines. Figure 14.9a displays seven compositions and ‘straight’ lines between them. Red square: c1 = (0.333, 0.333, 0.333)—center point; black star: c2 = (0.446, 0.446, 0.107); magenta star: c3 = (0.485, 0.485, 0.029); blue star: c4 = (0.499, 0.499, 0.001); green star: c5 = (0.091, 0.908, 0.001); red star: c6 = (0.001, 0.972, 0.027); cyan star: c7 = (0.067, 0.837, 0.097). Figure 14.9b shows the log-ratio (see below) distance matrix. It is instructive to note that the largest distance is dA (c5, c6) = 5.93, and the following holds 294 S.K. Suram et al. dA (c4, c5) = 1.90 < dA (c1, c3) = 2.30 < dA (c3, c4) = 3.19 In real data, zero components, missing values are often present. Moreover, concentrations below instrument detection limit (DL) are routinely encountered in experiments. Usually such nondetects and missing values are erroneously replaced by zeros. Since the statistical analysis of compositional data is based on logratios it cannot be applied to data with zero components. One approach to this problem is to transform N-element compositional data onto the surface of (N−1) dimensional hypersphere [48], thus bringing the well-developed methods of directional data analysis to compositional data and allowing to deal with zero components. After applying such transforms, the multiscale methods developed for general compact manifolds and for a sphere in particular, will be especially useful for multiscale analysis of functions (like FOMs) defined on compositional space [6, 18, 20, 21, 50, 54–59]. Another recent approach is based on finding that logratio analysis is in fact intimately connected with correspondence analysis (CA) [23–25]. There exists a family of methods parameterized by a power-transformation of the original compositional data: when this power is equal to 1 the ensuing method is exactly CA, and when this power tends to zero the limiting method is exactly LRA. In between we have a continuum of interesting special cases, for example, square root and double square root transformations, but the main point is that these two apparently unrelated and competing methods are really members of a wider common family [24, 25]. 14.6.6 Interpolation of Compositional Data: Composition Profiles from Sputtering In this section we use spatial interpolation of composition measurements, a standard operation in combinatorial research, to demonstrate how behavior of logratio variables differs from that of raw compositional variables. To create a synthetic dataset, we employ a common combinatorial synthesis technique, multi-source cosputtering of a composition spread thin film. Combinatorial sputtering is commonly used for synthesis of binary, ternary, and quaternary thin film libraries. See for more details [61] We will assume that accurate, noise-free measurements of the 4 compositional variables are made on a set of 25 substrate positions chosen as a 55 square grid with 25 mm spacing. An appropriate spatial interpolation method should at least guarantee that the non-negativity and constant-sum constraints are satisfied. In fact, among the conventional unconstrained interpolation techniques, linear interpolation satisfies these requirements. However, usual straightforward approaches, even if they satisfy the constraints, interpolate each component xi independently, thus ignoring the inner relationships between the compositional elements. Since our end goal is to enable 14 High Throughput Combinatorial Experimentation + Informatics … 295 the analysis of covariance structure of compositions without the artifacts of induced correlation and sub-compositional incoherence, an approach that leads to accurate logratio values employed by the simplicial distance (14.16) and compositional covariance matrix T (14.18) is required. To achieve this, we utilize a broadly-applicable and highly versatile technique based on kriging. The kriging-based interpolation was computed with R language and environment for statistical computing by applying the R-package “compositions” [65] van der Boogaart and Tolosana-Delgado 2013). The method exploits codependences in the composition and takes into account the spatial covariance structure by modeling the set of variograms for all possible pairwise balances (14.17). It takes into account various effects and parameters including the nugget effect, the choice of exponential and spherical variograms which parameters we chose to be 62.5 and 162.0 respectively. Since this interpolation technique is specialized for compositional data, we refer to it as “compositional interpolation” and represent the result of compositional interpolation of the 25 z i measurements as CompInterp(z i ). To attain an analogous result using traditional linear interpolation, xi and x N can be independently interpolated followed by calculating z i (14.12), resulting in a spatial map of z i referred to as LinInterp(z i ). The results of compositional and linear interpolations and their comparisons with the “perfect” data calculated from the model compositions are shown in Fig. 14.10. By definition, both interpolation methods produce exact values at each of the 25 locations in the sampling grid. The assessment of the performance of a given interpolation is thus performed by evaluat- Fig. 14.10 Interpolation of a 4-element composition. z 1 : logratio of compositions x1 and x4 with the 25 sampling points marked by “×”; LinInterp(z 1 ): logratio of the linearly interpolation of x1 and x4 ; CompInterp(z 1 ): logratio of the compositionally interpolated z 1 ; the difference between the model data and its linear interpolation; the difference between the model data and its compositional interpolation [61] 296 S.K. Suram et al. ing the absolute magnitude and pattern of interpolation error in the regions between the sampling points (the grid). Compared to linear interpolation, the compositional interpolation provides more accurate results and the discrepancy varies smoothly over the entire interpolation region. It is important to note that artificial ‘patchiness’ of the linear interpolation would distort simplicial distances (14.16) between different compositions. Such distortions would lead to artificial associations in the analysis of correlational structure of compositions and, more generally, to erroneous results in all calculations that involve distances, e.g. mean and standard deviation. Kriging assumes that the observed values are a realization of a stochastic process, so the quantitative advantages of compositional interpolation based on kriging should become more pronounced as variation of the composition variables increases. It is worth noting, that there are other interpolation methods that preserve the nonnegativity and constant-sum constraints such as local sample mean, inverse distance interpolation, and triangulation (since the weights they use range from 0 to 1, and sum to unity). However, unlike the approach utilized here (Tolosana-Delgado and van den Boogaart, [53, 71], those methods do not take into account the spatial covariance structure which may be critical for statistical analysis. As combinatorial materials science continues to expand into high order compositions spaces, the prudent application of statistical methods developed specifically for CDA will be required to enable accurate data mining. 14.7 Summary and Conclusions A central premise of high throughput combinatorial science is that systematic measurements of material libraries can reveal relationships among material composition, structure, performance and other properties. To facilitate accurate and effective extraction of information from large, complex data sets created by high throughput experiments, the materials scientists must engage in close interdisciplinary communication/collaboration with researchers in other disciplines such as statistics, computer science, applied mathematics, and artificial intelligence. One step in fostering a close interdisciplinary communication/collaboration between materials scientists and researchers in other disciplines would be to follow the example of top journals in other fields and have statisticians as members of the Editorial Boards of journals concerned with high throughput and combinatorial materials science. This would ensure that the adequacy of the statistical analysis used in papers is properly evaluated and, more importantly, will enable journals to formulate statistics guidelines for contributors [3, 30, 67]. Moreover, it is important to emphasize that statisticians should not only be consulted after data had already been generated, but rather should be involved in the design of experiments. It is only through prudent incorporation of informatics in high throughput workflows that combinatorial materials science can be fully realized. This chapter introduced high throughput experimental pipelines and example data to illustrate 2 areas of informatics that are central to combinatorial materials science. The high throughput strategy of tiered screening and resulting 14 High Throughput Combinatorial Experimentation + Informatics … 297 complex datasets demonstrate the need for new statistical techniques that enable the generation of information rich databases and provide accurate assessment of composition-property relationships. While further research is required to assimilate and advance these informatics techniques, foundational work in these research areas is presented. Acknowledgments The authors would like to thank Prof. Alfred Ludwig for stimulating discussions. This work is performed by the Joint Center for Artificial Photosynthesis, a DOE Energy Innovation Hub, supported through the Office of Science of the U.S. Department of Energy under Award Number DE-SC000499. References 1. J. Aitchison, The statistical analysis of compositional data (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 44, 139–177 (1982) 2. J. Aitchison, The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability (Chapman & Hall Ltd., London, 1986) (2d edition with additional materials, The Blackburn Press, 2003) 3. D. Altman et al., Statistical guidelines for contributors to medical journals. BMJ 286, 1489– 1493 (1983) 4. P.V. Balachandran, S.R. Broderick, K. Rajan, Identifying the inorganic gene for hightemperature piezoelectric perovskites through statistical learning. Proc. R. Soc. Math. Phys. Eng. Sci. 467, 2271–2290 (2011). doi:10.1098/rspa.2010.0543 5. S. Bandyopadhyay, U. Maulik, Nonparametric genetic clustering: comparison of validity indices. IEEE Trans. Syst. Man Cybern. Part C (Applications and Reviews) 31, 120–125 (2001). doi:10.1109/5326.923275 6. S. Bernstein, I. Pesenson, Crystallographic and Geodesic Radon Transforms on SO(3): motivation, generalization, discretization. in Geometric Analysis and Integral Geometry, Contemporary Mathematics, vol 598 (2013) (a volume dedicated to 85th birthday of S. Helgason) 7. M. Borenstein, L. Hedges, J. Higgins, H. Rothstein, Introduction to Meta-Analysis (Wiley, New York, 2009) 8. N. Boric, P.A. Estévez, Genetic Programming-based Clustering Using an Information Theoretic Fitness Measure. in Proceedings of the IEEE Congress on Evolutionary Computation (CEC 2007), pp. 31–38 (2007) 9. S.R. Broderick, K. Rajan, Eigenvalue decomposition of spectral features in density of states curves. EPL (Europhysics Letters) 95, 57005 (2011). doi:10.1209/0295-5075/95/57005 10. P.J.S. Buenconsejo, A. Ludwig, Composition-structure-function diagrams of Ti-Ni-Au thin film shape memory alloys. ACS Comb. Science 16, 678–685 (2014) 11. C.M. Caskey, R.M. Richards, D.S. Ginley, A. Zakutayev, Thin film synthesis and properties of copper nitride, a metastable semiconductor. Mater. Horiz. 1, 424 (2014). doi:10.1039/ c4mh00049h 12. J.N. Cawse, Experimental Design for Combinatorial and High Throughput Materials Development (Wiley, New York, 2002) 13. T. Chikyow, P. Ahmet, K. Nakajima, T. Koida, M. Takakura, M. Yoshimoto, H. Koinuma, A combinatorial approach in oxide/semiconductor interface research for future electronic devices. Appl. Surf. Sci. 189, 284–291 (2002). doi:10.1016/S0169-4332(01)01004-2 14. Committee on the Analysis of Massive Data, Frontiers in Massive Data Analysis (The National Academies Press, Washington, 2013) 15. G. Cummings, Understanding the New Statistics. Effect Sizes, Confidence Intervals, and MetaAnalysis (Routledge, London, 2012) 298 S.K. Suram et al. 16. L. Cwiklik, B. Jagoda-Cwiklik, M. Frankowicz, Influence of the spacing between metal particles on the kinetics of reaction with spillover on the supported metal catalyst. Appl. Surf. Sci. 252(3), 778–783 (2005). doi:10.1016/j.apsusc.2005.02.107 17. Data-Enabled Science in the mathematical and Physical Sciences, A workshop funded by the national Science Foundation (2010), https://www.nsf.gov/mps/dms/documents/DataEnabledScience.pdf 18. C. Durastanti, Y. Fantaye, F. Hansen, D. Marinucci, I. Pesenson, A simple proposal for radial 3D needlets. Phys. Rev. D. (Accepted) (2015) 19. J. Fan, F. Han, H. Liu, Challenges in big data. Natl. Sci. Rev. 1, 1–22 (2014) 20. D. Geller, D. Marinucci, Spin wavelets on the sphere. J. Fourier Anal. Appl. 16, 840–884 (2010) 21. D. Geller, I. Pesenson, Bandlimited localized Parseval frames and Besov spaces on compact homogeneous manifolds. J. Geom. Anal. 21(2), 334–371 (2011) 22. E. Gokcay, J.C. Principe, Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Intell. 24, 158–171 (2002). doi:10.1109/34.982897 23. M.J. Greenacre, Correspindence Analysis in Practice (Chapman & Hall, London, 2007) 24. M.J. Greenacre, Log-ratio analysis is a limiting case of correspondence analysis. Math. Geosci. 42, 129–134 (2010) 25. M.J. Greenacre, Measuring subcompositional incoherence. Math. Geosci. 43, 681–693 (2011) 26. J. Gregoire, J. Haber, S. Mitrovic, C. Xiang, S. Suram, P. Newhouse, E. Soedarmadji, M. Marcin, K. Kan, D. Guevarra, Enabling solar fuels technology with high throughput experimentation, in Paper presented at the MRS Proceedings (2014) 27. J.M. Gregoire, D. Dale, A. Kazimirov, F.J. DiSalvo, R.B. van Dover, High energy x-ray diffraction/x-ray fluorescence spectroscopy for high-throughput analysis of composition spread thin films. Rev. Sci. Instrum. 80, 123905 (2009). doi:10.1063/1.3274179 28. J.M. Gregoire, D.G. Van Campen, C.E. Miller, R. Jones, A.M. Suram SK, High throughput synchrotron X-ray diffraction for combinatorial phase mapping. J. Synchrotron Radiat. 21(6), 1262–1268 (2014) 29. J.M. Gregoire, C.X. Xiang, X.N. Liu, M. Marcin, J. Jin, Scanning droplet cell for high throughput electrochemical and photoelectrochemical measurements. Rev. Sci. Instrum. 84(2) (2013). doi:10.1063/1.4790419 30. Guidelines for Using Confidence Intervals for Public Health Assessment, Washington State Department of Health (2012) 31. J.A. Haber, Y. Cai, S. Jung, C. Xiang, S. Mitrovic, J. Jin, A.T. Bell, J.M. Gregoire, Discovering Ce-rich oxygen evolution catalysts, from high throughput screening to water electrolysis. Energy Environ. Sci. 7(2), 682 (2014a). doi:10.1039/c3ee43683g 32. J.A. Haber, D. Guevarra, S. Jung, J. Jin, J.M. Gregoire, Discovery of new Oxygen evolution reaction electrocatalysts by combinatorial investigation of the Ni–La–Co–Ce Oxide composition space. ChemElectroChem 1613–1617 (2014). doi:10.1002/celc.201402149 33. A. Shinde, R.J. Jones, D. Guevarra, S. Mitrovic, N. Becerra-Stasiewicz, J.A. Haber, J. Jin, J.M. Gregoire, High-throughput screening for acid-stable oxygen evolution electrocatalysts in the (Mn − Co − T a − Sb) Ox composition space, Electrocatalysis 6(2), 229–236 (2015) 34. J.A. Haber, C. Xiang, D. Guevarra, S. Jung, J. Jin, J.M. Gregoire, High throughput mapping of electrochemical properties of (Ni-Fe-Co-Ce)Ox Oxygen evolution catalysts. Chem. Electro. Chem. 1(3), 524–528 (2014) 35. J.R. Hattrick-Simpers, W.S. Hurst, S.S. Srinivasan, J.E. Maslar, Optical cell for combinatorial in situ Raman spectroscopic measurements of hydrogen storage materials at high pressures and temperatures. Rev. Sci. Instrum. 82, 033103 (2011). doi:10.1063/1.3558693 36. E. Jaynes, Information theory and statistical mechanics. Phys. Rev. 106, 620–630 (1957). doi:10.1103/PhysRev.106.620 37. R. Jenssen, D. Erdogmus, K. Hild, J.C. Principe, T. Eltoft, Optimizing the Cauchy-Schwarz PDF distance for information theoretic, non-parametric clustering, in Int’l Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 34-35 (2005) 38. R. Jenssen, J.C. Principe, D. Erdogmus, T. Eltoft, The cauchy-schwarz divergence and parzen windowing: connections to graph theory and mercer kernels. J. Franklin Inst. 343, 614–629 (2006). doi:10.1016/j.jfranklin.2006.03.018 14 High Throughput Combinatorial Experimentation + Informatics … 299 39. R.J. Jones, D. Guevarra, A.S. Shinde, C. Xiang, J.A. Haber, J. Jin, J.M. Gregoire, Parallel electrochemical treatment system. ACS Comb. Sci. 17(2), 71–75 (2015) 40. D. Kan, C.J. Long, C. Steinmetz, S.E. Lofland, I. Takeuchi, Combinatorial search of structural transitions: systematic investigation of morphotropic phase boundaries in chemically substituted BiFeO3. J. Mater. Res. 27, 2691–2704 (2012). doi:10.1557/jmr.2012.314 41. C.S. Kong, W. Luo, S. Arapan, P. Villars, S. Iwata, R. Ahuja, K. Rajan, Information-theoretic approach for the discovery of design rules for crystal chemistry. J. Chem. Inf. Model. 52, 1812–1820 (2012). doi:10.1021/ci200628z 42. J. Kruschke, Bayesian estimation supersedes the t-test. J. Exp. Psychol. Gen. (2012) 43. J. Kruschke, Doing Bayesian Data Analysis, 2nd edn. (Academic Press, Waltham, 2014) 44. A.G. Kusne, T. Gao, A. Mehta, L. Ke, M.C. Nguyen, K.-M. Ho, V. Antropov, C.-Z. Wang, M.J. Kramer, C. Long, I. Takeuchi, On-the-fly machine-learning for high-throughput experiments: search for rare-earth-free permanent magnets. Sci. Rep. 4, 6367 (2014). doi:10.1038/srep06367 45. R. Lebras, T. Damoulas, J.M. Gregoire, A. Sabharwal, C.P. Gomes, R.B. Dover, Constraint Reasoning and Kernel Clustering for Pattern Decomposition With Scaling, in Proceedings of the 17th international conference on Principles and practice of constraint programming, pp. 508–522 (2011) 46. J. Leek, R. Scharpf, H. Bravo, Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. 1, 733–739 (2010) 47. H. Li, Y. Liang, Q. Xu, Support vector machines and its applications in chemistry. Chemom. Intell. Lab. Syst. 95, 188–198 (2009). doi:10.1016/j.chemolab.2008.10.007 48. K.V. Mardia, P.E. Jupp, Directional Statistics, 2nd edn. (Wiley, New York, 2000), p. 160 49. W.F. Maier, K. Stowe, S. Sieg, Combinatorial and high-throughput materials science. Angew. Chem. Int. Ed. 46, 6016–6067 (2007) 50. D. Marinucci, G. Peccati, Random Fields on the Sphere. London Mathematical Society Lecture Note Series (2011) 51. S. Mitrovic, E. Soedarmadji, P.F. Newhouse, S. Suram, J.A. Haber, J. Jin, J.M. Gregoire, Colorimetric screening for high-throughput discovery of light absorbers. ACS Comb. Sci 52. D.P. Muni, N.R. Pal, J. Das, A novel approach to design classifiers using genetic programming. IEEE Trans. Evol. Comput. 8, 183–196 (2004). doi:10.1109/TEVC.2004.825567 53. V. Pawlowsky-Glahn, A. Buccianti (eds.), Compositional Data Analysis: Theory and Applications (Wiley, New York, 2011) 54. I. Pesenson, Sampling of Paley-Wiener functions on stratified groups. J. Fourier Anal. Appl. 4(3), 271–281 (1998) 55. I. Pesenson, Paley-wiener approximations and multiscale approximations in sobolev and besov spaces on manifolds. J. Geom. Anal. 19(2), 390–419 (2009) 56. I. Pesenson, A sampling theorem on homogeneous manifolds. Trans. Am. Math. Soc. 352(9), 4257–4269 (2000) 57. I. Pesenson, Springer Handbook of Geomathematics, Splines and Wavelets on Geophysically Relevant Manifolds (Springer, Berlin, 2015), pp. 1–32 58. I. Pesenson, Multiresolution Analysis on Compact Riemannian Manifolds, in Multiscale Analysis and Nonlinear Dynamics: From Genes to the Brain, ed. by M. Pesenson (Wiley-VCH, Weinheim, 2013), pp. 65-82 59. M.Z. Pesenson, I.Z. Pesenson, Adaptive multiresolution analysis based on synchronization. Phys. Rev. E 84, 045202(R) (2011) 60. M.Z. Pesenson, Multiscale Analysis—Modeling, Data, Networks, and Nonlinear Dynamics, in Multiscale Analysis and Nonlinear Dynamics, Wiley Reviews of Nonlinear Dynamics and Complexity, ed. by M.Z. Pesenson (Wiley-VCH, Weinheim, 2013), pp. 1–19 61. M.Z. Pesenson, S. Suram, J.M. Gregoire, Statistical analysis and interpolation of compositional data in materials science. ACS Comb. Sci. 17 (2), 130–136 (2015) 62. M.Z. Pesenson, S. Suram, J. Haber, D. Guevara, P. Newhouse, E. Soedarmadji, J.M. Gregoire, Correlation Structure of High Throughput Composition Screening Libraries (in preparation) (2015) 300 S.K. Suram et al. 63. R. Potyrailo, V.M. Mirsky, Combinatorial Methods for Chemical and Biological Sensors (Springer Science & Business Media, Berlin, 2009), p. 125 64. J. Principe, D. Xu, J. Fisher, Information theoretic learning. Unsupervised adaptive filtering, vol. 1 (Wiley, New York, 2000) 65. R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria (2004) 66. K. Rajan, Combinatorial Materials Sciences: Experimental Strategies for Accelerated Knowledge Discovery. Ann. Rev. Mater. Res. 38, 299–322 (2008) 67. H. Roediger, What’s New at Psychological Science. An Interview with Editor in Chief (2013). http://www.psychologicalscience.org/index.php/publications/observer/2013/ november-13/whats-new-at-psychological-science.html 68. X. Shi, J. Luo, N.P. Njoki, Y. Lin, T.-H. Lin, D. Mott, S. Lu, C.-J. Zhong, Combinatorial Assessment of the Activity-Composition Correlation for Several Alloy Nanoparticle Catalysts. Ind. Eng. Chem. Res. 47, 4675–4682 (2008). doi:10.1021/ie800308h 69. S.K. Suram, J.A. Haber, J. Jin, J. Gregoire, Generating information rich high-throughput experimental materials genomes using functional clustering via multi-tree genetic programming and information theory. ACS Comb. Sci. 17 (4), 224–233 (2015) 70. R. Tolosana-Delgado, K. van den Boogaart, V. Pawlowsky-Glahn, Geostatistics for Compositions, in Compositional Data Analysis: Theory and Applications, eds. by V. Pawlowsky-Glahn, A. Buccianti (Wiley, Chichester, 2011), pp 73-86 71. K. van den Boogaart, R. Tolosana-Delgado, Analyzing Compositional Data with R, Use R! Series (Springer, Berlin, 2013) 72. D.W. van der Merwe, A.P. Engelbrecht, Data clustering using particle swarm optimization. 2003 Congr. Evol. Comput. 1, 215–220 (2003). doi:10.1109/CEC.2003.1299577 73. R. Wilcox, Fundamentals of Modern Statistical Methods. Substantially Improving Power and Accuracy, vol. 2 (Springer, New York, 2010) 74. C. Xiang, J. Haber, M. Marcin, S. Mitrovic, J. Jin, J.M. Gregoire, Mapping quantum yield for (Fe-Zn-Sn-Ti)Ox photoabsorbers using a high throughput photoelectrochemical screening system. ACS Comb. Sci. 16(3), 120–127 (2014a). doi:10.1021/co400081w 75. C. Xiang, S.K. Suram, J.A. Haber, D.W. Guevarra, J. Jin, J.M. Gregoire, A high throughput bubble screening method for combinatorial discovery of electrocatalysts for water splitting. ACS Comb. Sci. 16(2), 47–52 (2014b) 76. R. Zarnetta, P.J.S. Buenconsejo, A. Savan, S. Thienhaus, A. Ludwig, High-throughput study of martensitic transformations in the complete TieNieCu system. Intermetallics 26, 98e109 (2012) Index A Ab initio, 10, 69 Ab initio data, 187, 210 Absolute potts model, 119 Adaptive experimental design, 7 Additive manufacturing (AM), 141 AlSi10Mg, 143 Amorphous systems, 116 ANalysis Of VAriance (ANOVA), 143 Annealing parameter, 112 Approximate knowledge gradient (AKG), 67 Arrhenius plots, 182 Asymptotic representation, 85 Attribute-property relationships, 276 Automation of measurements, 244 Binary oxide, 82 Bioinformatics, 103 Block structure, 108 BOD, 200 Body-centered cubic (bcc), 31 Bolstered resubstitution (bol), 89 Bonding, 190 Bonding-anti-bonding transition, 235 Bonding-antibonding transition, 224 Bootstrap, 6 Bootstrapping and ensemble averaging, 167 Breathing distortion, 220 Buckingham Pi theorem, 24 Bulk, 248 Bulk metallic glasses (BMGs), 209 Bulk modulus, 190, 223 B Band structure, 11 Barsoum, 191 Bayes classifier, 78, 84 Bayes error, 78, 79, 86 Bayes’ rule, 15, 88 Bayes-optimality, 45 Bayesian, 48, 77 Bayesian error estimate, 89 Bayesian experimental design, 21, 33, 70 Bayesian global optimization, 46 Bayesian inference, 14 Bayesian MMSE error estimation, 98 Bayesian optimization, 45–47, 56, 60, 61, 68–70, 175 Bayesian statistics, 46 Bernoulli, 68 Binary classification, 78 C Ca-Si-Al-hydrates, 209 Cahn-Hilliard equation, 22 Calcium silicate hydrate (CSH), 206 Canonical finite temperature phase diagram, 125 Capacitance, 259 Catalysis, 256 Catalytic, 260 Cathode materials, 175 Cauchy-Born, 191 Cauchy-Born criterion, 192 Cauchy-Schwarz divergence, 280 Cauchy-Schwarz divergence function, 281 CDA, 287 CDF transformation, 27 Cerium oxides, 158 CFS method, 146, 147 © Springer International Publishing Switzerland 2016 T. Lookman et al. (eds.), Information Science for Materials Discovery and Design, Springer Series in Materials Science 225, DOI 10.1007/978-3-319-23871-5 301 302 Charge disproportionation, 216 Chemistry-modulus relationship, 235 Chemoinformatics, 174 China, 243 Cholesky decomposition, 61 Clarke formula, 201 Class-conditional densities, 86, 91 Class-conditional distribution, 79, 87 Classification, 5, 78 Classification and clustering problems, 112 Classification rule, 80 Classifier, 77 Closure operation, 291 Clustering, 113 Clustering techniques, 279 Combinatorial materials science, 241, 273, 274, 286 Combinatorial optimization, 126 Combinatorial powder synthesis, 249 Community detection, 115, 118, 119, 125 Complex materials, 128 Complex multi-component systems, 254 Complex system, 115, 116 Composite ceramics, 254 Composition, 265 Composition clusters, 286 Composition library, 281 Composition-(micro)structure-experimental parameters-property relationships, 279 Composition-property function, 283 Composition-structure-property correlation maps, 288 Compositional covariance matrix, 295 Compositional interpolation, 296 Computational genomics, 188 Computational screening, 158 Computational susceptibility, 126 Conjugate priors, 92 Consistent, 80 Correlation, 233 Correlation coefficients, 198 Correlation structure, 289 Correlation-based Feature Selection (CFS), 147, 148 Correlations between variables, 164 Correspondence analysis (CA), 273, 294 Cost function, 118 Countries and institutions, 247 Covariance function, 48, 49 Covariance kernel, 48, 60 Covariance matrix, 225 Index Covariance structures of screening libraries, 292 Covariances, 80 (Cr 2 Hf)2 Al3 C3 , 206 Cross-validation, 87, 89, 227, 234 Cross-validation error estimator, 82 Cubic perovskites, 82 Cumulative distribution function, 62 D 3D, 265 Data, 263 Data analysis, 168, 275 Data base, 261 Data clustering algorithm, 112 Data management, 275 Data mining, 188, 198, 260 Data mining and machine learning, 273 Data point, 263 Data point glyphs, 265 Data-driven approaches, 158 Data-driven machine learning, 69 Data-fit models, 19 Decision boundary, 96 Decision theory, 46 Decision tree, 4, 148 Decomposed the DOS spectra, 232 Deconvolution, 232 D-electron valency, 229 Delithated, 175 Density function, 88 Density functional theory (DFT), 185, 190, 223 Density of states (DOS), 195, 223 Descriptor, 164, 189, 191, 197, 200, 210, 214 Design cost, 80 Design of experiments, 275 Dielectrics, 251 Dilute magnetic semiconductor, 253 Dimensionality reduction, 165 Directional data analysis, 294 Dirichlet priors, 98 Discrete histogram rule, 84 Distance and density based clustering approaches, 279 Distance matrix, 103 Distortion-mode decomposition analysis, 214 Distribution-free, 81, 84 DOS spectra, 230 Double asymptotic representations, 85 Down-selection methods, 277 Index Dynamical heterogeneities, 132 E Eagar-Tsai (E-T) model, 145, 146, 148, 150 Edge, 118 Effective class-conditional density, 91 Efficient Global Optimization (EGO), 9, 46, 64 Electrocatalyst screening, 277 Electrochemical, 260 Electronic bandwidth-controlled, 217 Electronic nose, 260 Electronic structure, 190 Electronic structure-crystal structureproperty relationships, 237 Empirical Bayes, 57 Empirical risk minimization, 81 Energy landscape, 117 Energy-storage, 176 Ensemble, 117 Error estimation, 7, 78 Estimated error, 84 Estimation rule, 82 Estimator, 18 E-T, 146, 149–151 Euclidean distance, 50 Euclidean distance matrix, 104 Euclidean metric based Gaussian kernels, 286 Euclidean-based correlation structure, 287 Evanescent microwave probe (EMP), 258 Evidence, 18 Evolutionary operations of genetic programming, 279 Expected improvement, 62, 68 Expected improvement method, 69 Expected information gain, 17 Expected KL divergence, 33 Expected utility, 14 Expected value of sample information (EVSI), 68 Experimental design, 14 Experimental noise, 56 Experimental uncertainty, 35 Exploration and exploitation, 7 Exploration vs. exploitation tradeoff, 63 Eyre’s method, 22 F Face-centered cubic (fcc), 31 Feature, 146 Feature-label distribution, 78, 86, 88, 98 303 Feature reduction, 113 Features, 9, 78, 214 Feature selection, 160, 164 Features or fingerprint, 164 Features Space, 104, 107 Feature vector, 78 Feedback, 261 Fermi level (EF ), 195 Ferroelectric, 113, 251 Ferromagnetic, 121 First principles, 168 First principles methods, 158 First-order phase transition, 112 Fisher, R.A., 97 Fourier weights, 124 Frequentist, 26, 47 Full factorial experimental design, 143 Full fractional design, 146 Functional polymers, 10 G Gas sensing, 259 Gaussian kernel, 49 Gaussian model, 79, 89, 91, 94, 98 Gaussian processes, 181 Gaussian process regression, 45–48, 56, 58– 60, 69 Gaussian random field, 21 General character and scale, 132 Genetic, 118 Genetic operators, 285 Genetic trees, 282 Genomics, 8 Glasses, 116 Glyph, 263 Gradient boosting, 4 Gradient descent, 58 Graphic, 263 Greedy algorithm, 120 Guided screening strategy, 164 Guidelines, 159 H Hamiltonian, 119, 121, 125, 126 Heat of formation (HoF), 191, 192 Heterophase interfaces, 29, 30 Heteroscedastic noise, 56 Hierarchical models, 19 Hierarchical structure, 121 Hierarchical surrogates, 38 High-dimensional hyperplane, 227 High dimensional transformation, 166 304 High entropy alloys, 209 High order composition spaces, 273, 296 High-throughput, 4 High throughput (HiTp), 273 High-throughput analysis, 242 High throughput combinatorial science, 296 High throughput experimental pipeline, 272, 275, 296 High-throughput measurement and analysis, 254 High-throughput screening, 174 High throughput synthesis and evaluation, 272 High-throughput topics, 246 High-throughput workflow, 279 HiTp experimentation, 277 HiTp materials characterization, 276 Holdout estimate, 82 Homoscedastic, 55 Homoscedasticity, 55 Human Genome Project, 187 Hungarian algorithm, 106 Hybrid approach, 234 Hybrid informatics approach, 237 Hydrogen storage, 259 Hydrothermal, 253 Hyperparameters, 52, 57, 87, 96 Hysteresis loop, 258, 264 I Ideal perovskite, 113 Image, 263 Image segmentation, 128 Impedance screening, 260 Improvement, 62 In silico, 69 Individual regression models, 167 Induced correlations, 290 Industry, 244 Inference, 51 Informatics, 224 Informatics tools, 272 Information gain, 18, 27 Information science, 4 Information theoretic, 117 Information-theoretic approach, 280 Information theoretic objectives, 37 Information theory, 118, 125, 127 Information theory correlations, 116 Initialization, mutation, selection, crossover and termination, 282 Ink-jet printing, 250 Index Inverse design, 223 Inverse-Wishart distribution, 93 Ionic conductors, 251 Irreducible representation, 215 Island scanning, 152 J Jahn-Teller distortions, 220 Japan, 243 Jeffreys prior, 27 K Kernel width, 280 KG factor, 66 Knowledge-gradient, 9, 60, 65, 68 Knowledge-gradient algorithm, 65, 66 Knowledge-gradient (KG) factor, 66 Knowledge-gradient method, 69 Kob-Andersen binary system, 130 Kriging, 47, 181, 296 Kullback-Leibler (KL) divergence, 17 L Labels, 78 Landau theory, 215 Laser powder-bed fusion, 141 Lattice parameters, 32 Lattice thermal conductivity, 201 Layered metal composites, 29 Learning, 175 Leave-one-out cross validation, 58 Leave-one-out error, 84 Leave-one-out estimator, 82, 85 Lennard-Jones, 129 Likelihood function, 34 Lindemann rule, 178 Linear data-dimensionality reduction, 218 Linear discriminant analysis (LDA), 80, 81 Linear regression models, 227 Lithium ion batteries, 175 LIthium Super Ionic CONductors (LISICON), 182 Loadings and scores matrices, 225 Loadings plot, 229 Local minimum, 120 Logratio analysis, 294 Logratio analysis to interpolation, 288 Logratio transforms, 288 Low O vacancy formation energy, 161 Low valence state, 161 316L stainless steel, 142, 144, 145, 154 Index M Machine learning, 4, 116, 164, 178, 188 Machine learning algorithms, 112 Magnetic force microscopy (MFM), 256 Magnetoelectric, 254 Many-body, 130 Marginal likelihood, 16, 34, 58 Markov decision process, 68 Markov-chain-Monte-Carle (MCMC), 98 Matérn covariance kernel, 50 Materials design, 45 Materials Genome, 4 Materials genome initiative, 116 Materials informatics, 4, 103 Materials Space, 104, 107 MAX database, 197 MAX phases, 187, 189 MAX solid solutions, 205 Maxenes, 191 Maximum membership class, 284 MAX (Mn+1 AXn ) phase, 188 Mean function, 51 Mean-square error (MSE), 82, 147–149 Mechanical properties, 190 Mechanistic insights, 214 Merit, 146, 147 Metal to insulator transitions, 216 Metallic glass, 130 Metals, oxides and ceramics, 247 Metric vector space structure, 291 Minimization Procedure, 106 Minimum thermal conductivity κmin , 203 Minimum-mean-square-error (MMSE), 77 Mining and extraction, 164 Misclassification, 78 Misfit, 31 Mixed moments, 85 MMSE estimate, 89 Model-based optimal experimental design, 14 Model discrimination, 17, 29 Model selection, 14, 33 Model space, 18 Moiré pattern, 32 Molecular beam epitaxy (MBE), 248 Monte Carlo, 35 Monte Carlo optimization methods, 112 MT-GP algorithm, 284, 285 MT-GP approach, 282 Multi scale community detection, 116 Multi-scale “inverse design”, 235 Multi-scale community detection, 129 Multi-tree genetic programming, 280 305 Multinomial discrimination, 85 Multinomial model, 85 Multiresolution analysis (MRA), 274 Multivariate model, 85 Multivariate normal, 48 Multivariate normal random, 51 Multivariate student’s t-distribution, 93, 95 Mutli-resolution analysis, 130 Mutual information, 17 MXenes, 205 N National institute of standards and technology (NIST), 244 N(EF ), 195 Neighborhood, 105 Neighborhood ordering, 107, 109 Newton’s method, 58 Nix Nb1−x , 209 Node, 118, 120 Non-Euclidean, 272 Non-Euclidean space, 274 Normal-inverse-Wishart distribution, 93 Normal-inverse-Wishart prior, 89, 96 Normalized mutual information (NMI), 122 NP hard, 126 Numerical quadrature, 35 O Objective cost of uncertainty, 98 Objective function, 60, 281 Occam’s razor, 16 OER catalyst activity, 276 OLCAO, 192 Open Quantum Materials Database, 173 Optical, 256 Optimal Bayesian classifier, 90, 94 Optimal Bayesian experimental design, 37 Optimal classifiers, 77 Optimal constrained classifier, 81 Optimal experimental design, 14 Optimal learning, 70 Optimal membership set, 284 Optimal minimum-mean-square-error (MMSE), 87 Optimistic bias, 84 Optimization, 8, 11, 19, 181 Order parameter, 215 Order parameter field, 22 OS supersedes, 167 OS, IR and EA, 165 Over-fitting, 230 306 Overfitting, 81 Oxidation state, ionic radius and electron affinity, 168 P Pair density correlations, 132 Parallel tempering, 112 Parameter inference, 14, 17 Partial and infinite swapping, 112 Partial least squares, 225 Pattern recognition model, 84 PBOD, 191 PCA analysis, 232 PCA on the DOS spectra, 228 Pearson correlation coefficient, 289 Permittivity, 264 Permutation matrix, 106 Perovskite, 216 Perovskite crystal structure, 106 Perturbation, 291 Phase diagram, 126 Phosphors, 250 Photocatalytic, 250 Physical vapor deposition, 30 Pie charts, 264 Piezoelectric data, 106 Piezoelectrics, 253 Piezoresponse force microscopy (PFM), 256 Poisson’s ratio, 190 Polynomial chaos expansions, 20 Portland cement, 206 Posterior, 98 Posterior distribution, 62, 91 Posterior probability, 16 Posterior probability distribution, 48 Potts model, 121 Powder-based AM, 141 Powering, 291 Principal component analysis (PCA), 165, 218, 224, 273 Principal components (PCs), 165, 225 Prior distribution, 48, 49, 87 Prior knowledge, 8, 86 Prior probability distribution, 48 Probabilistic model, 290 Promising candidates, 164 Proper orthogonal decomposition (POD), 20 Pugh moduli, 190 Pulsed laser deposition (PLD), 248 Q QSPR for the bulk modulus, 233 Index Quadratic discriminant analysis (QDA), 80, 81 Qualitatively, 265 Quantitative, 265 Quantitative structure-activity relationship, 174 Quantitative structure-property relationship (QSPR), 225, 230 Quaternary alloys, 206 R Radial-basis-function support vector machine (RBF-SVM), 83 Random forests, 165 Ranking and selection, 68 Rational strategies, 158 Reduced-order, 14 Reduced-order models (ROMs), 19, 37 Regression, 6 Regression tree, 148, 149 Relative volume-change, 176 Replicas, 122 Resolution parameter, 120, 121 Resubstitution, 89 Resubstitution error estimate, 85 Resubstitution estimator, 85 Reverse Monte Carlo methods, 132 Robot, 261 Robustness, 69 ROC curves, 84 Root-mean-square (RMS), 82 RTMS XRD detectors, 257 S Sabatier’s principle, 160 Sc, Cr, Y, Zr, Pd and La, 168 Sc, Cr, Zr, La, Pd and Y, 162 Scanning evanescent microwave microscope (SEMM), 258 Scanning magneto-optical kerr effect (SMOKE), 260 Scanning probe microscopy (SPM), 256 Scanning SQUID, 260 Scree plot, 219 Screening criteria, 160 Screening framework, 168 Segments, 123 Selective laser melting (SLM), 141 Separate sampling, 80 Shannon entropy, 125–127 Shannon information, 17 Shear modulus, 190 Index Side-to-Side, 105 Side-to-Side ordering, 107, 109 Signatures of the DOS, 234, 235 Signatures of the DOS spectra, 224 Similar thermodynamic activity, 166 Simplicial distance, 295 Simulation codes, 11 Single-track experiments, 150 Sintered combinatorial libraries, 251 Slack, 201 SOFC, 253 Solid state materials, 272 Solvable hard, 126 Solvable phase, 127 Space-time, 128 Spatial interpolation of composition measurements, 294 Spectral parameters, 228 Spread, 292 Squared exponential, 49 Squared exponential covariance kernel, 50 Squared exponential kernel, 50 SrM, 264 Stars, 164 State-constrained optimal Bayesian classifier (SCOBC), 96 Statistical, 8 Statistical correlations, 214 Statistical mechanics, 116 Stochastic approximation, 19 Stochastic block models, 126 Structural glasses, 116 Structure-modulus relationship, 235 Structure-property relationship clustering, 285 Structure-property relationships, 213 Sub-compositional coherence, 290 Subcomposition, 290 Subcompositional incoherence, 290 Substrate length, 23 Suolunite, 209 Support vector machines, 4 Surrogate, 38 Surrogate models, 14, 19 Swarm intelligence, 118 Symmetry-breaking, 214 Symmetry-mode analysis, 215 T TBDO, 200 Temperature, 126 Ternary system, 130, 264 Thermal, 260 Thermal conductivity, 178 307 Thermal fluctuations, 121 Thermochemical splitting of water, 158 Thermodynamic functions, 128 Thermoelectrics, 177 Thick film ceramics, 248 Thin-film, 214, 248 Ti2 Al(Cx N1−x ), 206 Tiered pipeline screening, 279 Tolerance factor, 112, 215 Total bond order (TBO), 191 Total bond order density (TBOD), 189, 191, 200, 209, 210 Training set, 199 Transition metals, 30 Triangulation, 296 True error, 84 U Uncertainties, 6 Uncertainty class, 86, 87 Uncertainty propagation, 20 Uniform prior, 88 Universal elastic anisotropy (AU ), 203 Universally consistent, 81 Unsolvable, 126 USA, 243 Utility function, 16 V Vapnik-Chervonenkis (VC) dimension, 81 Variation matrix, 292 VASP, 191 Virtual DOS, 237 Virtual DOS spectra, 224 Virtual screening, 173, 178 Visualizations algorithms, 103 Vitreloy, 209 VRH polycrystalline approximation, 190 W Water splitting/H2 , 256 WEKA, 199 Wisdom of the crowds, 117, 118 Y Young’s modulus, 190 Z Zr x Cu1−x , 209 Zr x Cuy Alz , 209