LNAI 10086 Jinyan Li · Xue Li Shuliang Wang · Jianxin Li Quan Z. Sheng (Eds.) Advanced Data Mining and Applications 12th International Conference, ADMA 2016 Gold Coast, QLD, Australia, December 12–15, 2016 Proceedings 123 Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science LNAI Series Editors Randy Goebel University of Alberta, Edmonton, Canada Yuzuru Tanaka Hokkaido University, Sapporo, Japan Wolfgang Wahlster DFKI and Saarland University, Saarbrücken, Germany LNAI Founding Series Editor Joerg Siekmann DFKI and Saarland University, Saarbrücken, Germany 10086 More information about this series at http://www.springer.com/series/1244 Jinyan Li Xue Li Shuliang Wang Jianxin Li Quan Z. Sheng (Eds.) • • Advanced Data Mining and Applications 12th International Conference, ADMA 2016 Gold Coast, QLD, Australia, December 12–15, 2016 Proceedings 123 Editors Jinyan Li University of Technology Sydney Ultimo, NSW Australia Jianxin Li University of Western Australia Crawley, WA Australia Xue Li University of Queensland Brisbane, QLD Australia Quan Z. Sheng University of Adelaide Adelaide, SA Australia Shuliang Wang Beijing Institute of Technology Beijing China ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artificial Intelligence ISBN 978-3-319-49585-9 ISBN 978-3-319-49586-6 (eBook) DOI 10.1007/978-3-319-49586-6 Library of Congress Control Number: 2016957378 LNCS Sublibrary: SL7 – Artificial Intelligence © Springer International Publishing AG 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface The year 2016 marked the 12th anniversary of the International Conference on Advanced Data Mining and Applications (ADMA). This conference series has been organized and held in China since 2005. ADMA 2016 was the first time the conference was held outside China, taking place on Australia’s Gold Coast during December 12–15, 2016. ADMA aims at bringing together both young researchers and senior experts from the world, and also at providing a leading international forum for the dissemination of original data mining findings and practical data mining experience. Over the years, ADMA has grown to become a flagship conference in the field of data mining and applications. This volume comprises 57 thoroughly reviewed and carefully revised full-length papers that were all presented at the conference. The conference Program Committee received 105 high-quality manuscripts. Each of them was assigned to at least three Program Committee members for comments. All papers were rigorously reviewed and had at least three reviews. Some borderline papers were even reviewed by five members of the committee if the first three reviews were not adequate for making a rejection or acceptance decision. Several research teams showed strong interest in ADMA 2016, and submitted multiple high-quality papers of different topics. It was difficult for the committee chairs to exclude some of these papers so as to reserve some space for other teams to present their latest research achievements. Although many papers were worthy of publication, only 18 spotlight research papers (acceptance rate of 17%, 15 pages each) and 39 regular research papers (13 pages each) could be accepted for presentation at the conference and publication in this volume. The selected papers covered a wide variety of important topics in the area of data mining, including parallel and distributed data mining algorithms, mining on data streams, graph mining, spatial data mining, multimedia data mining, Web mining, the Internet of Things, health informatics, and biomedical data mining. The conference program of ADMA 2016 was also complemented by four outstanding keynotes given by Yufei Tao, Pablo Moscato, Scott Johnston, and Gao Cong, as well as several research demonstrations. We would like to particularly thank the keynote speakers for presenting algorithms and ideas to solve data mining problems in the frontier areas of the field. We thank the Program Committee members for their time and invaluable comments. The schedule of the reviewing process was extremely tight. The members’ tremendous efforts to complete the review reports before the deadline are greatly appreciated. We read all of the review reports and found some of them to be truly excellent, as good as what is usually found in a survey—critical, digestive, and detailed. These comments were very helpful for us in selecting the papers. Thank you all and may the papers collected in this volume inspire the readers and open new windows of research. We would like to express our gratitude to all individuals, institutions, and sponsors that supported ADMA 2016. This high-quality program would not have been possible without the expertise and dedication of our Program Committee members. We are VI Preface grateful for the guidance of the general chairs (Michael Sheng and Osmar Zaiane), the tireless efforts of the demonstration chairs (Ji Zhang and Mingkui Tan), the publicity chair (Xiujuan Xu), the publication chair (Jianxin Li), the award chair (Zhifeng Bao), the Web chair (Wenjie Ruan), the special issue chair (Yongrui Qin), and the local organization chair (Weitong Chen). We would also like to acknowledge the support of the members of the conference Steering Committee. All of them helped make ADMA 2016 a success. Finally, we would like to thank all researchers, practitioners, and students who contributed with their work and participated in the conference. We hope that you find the papers in the proceedings interesting and stimulating. December 2016 Jinyan Li Xue Li Shuliang Wang Organization General Co-chairs Michael Sheng Osmar Zaiane University of Adelaide, Australia University of Alberta, Canada Program Co-chairs Jinyan Li Xue Li Shuliang Wang University of Technology Sydney, Australia University of Queensland, Australia Beijing Institute of Technology, China Demo Co-chairs Mingkui Tan Ji Zhang University of Adelaide, Australia University of Southern Queensland, Australia Proceedings Chair Jianxin Li University of Western Australia, Australia Awards Committee Chair Zhifeng Bao RMIT, Australia Publicity Chair Xiujuan Xu Dalian University of Technology, China Special Issue Chair Yongrui Qin University of Huddersfield, UK Sponsorship Co-chairs Shichao Zhang Xianxiu Si Guangxi Normal University, China China Academy of Telecom Research, China Local Chair Weitong Chen University of Queensland, Australia VIII Organization Web Chair Wenjie Ruan University of Adelaide, Australia Steering Committee Jie Cao Xue Li Michael Sheng Jie Tang Kyu-Young Whang Min Yao Osmar Zaiane Chengqi Zhang Shichao Zhang Shuliang Wang Nanjing University of Finance and Economics, China University of Queensland, Australia (Chair) University of Adelaide, Australia Tsinghua University, China Korea Advanced Institute of Science and Technology, Korea Zhejiang University, China University of Alberta, Canada University of Technology Sydney, Australia Guangxi Normal University, China Beijing Institute of Technology, China Program Committee Ayan Acharya Djamal Benslimane Jin Chen Liang Chen Chenwei Deng Xiangjun Dong Xiu Susie Fang Stefano Ferilli Philippe Fournier-Viger Xiaoying Gao Sung Ho Ha Jing He Qing He Wei Hu Xiaohua Tony Hu Guangyan Huang Akihiro Inokuchi Daisuke Kawahara Gang Li Haiquan Li Jianxin Li Jinyan Li Xue Li Zhixin Li Bo Liu University of Texas at Austin, USA Lyon 1 University, France Michigan State University, USA RMIT University, Australia Beijing Institute of Technology, China Qilu University of Technology, China University of Adelaide, Australia University of Bari, Italy Harbin Institute of Technology Shenzhen Graduate School, China Victoria University of Wellington, New Zealand Kyungpook National University, South Korea Victoria University, Australia Institute of Computing Technology, CAS, China Nanjing University, China Drexel University, USA Deakin University, Australia Kwansei Gakuin University, Japan Kyoto University, Japan Deakin University, Australia University of Arizona, USA University of Western Australia University of Technology Sydney, Australia University of Queensland, Australia Guangxi Normal University, China Guangdong University of Technology, China Organization Guohua Liu Yubao Liu Xudong Luo Jiangang Ma Marco Maggini Toshiro Minami Yasuhiko Morimoto Pablo Moscato Shirui Pan Tao Peng Shweta Purawat Tieyun Qian Kai Qin Yongrui Qin Wenjie Ruan Michele Ruta Arun Kumar Sangaiah Dharmendra Sharma Ali Shemshadi Michael Sheng Guojie Song Eiji Uchino Li Wan Hongzhi Wang Qinsi Wang Shuliang Wang Xianzhi Wang Zhihui Wang Jia Wu Zhiang Wu Feng Xia Zhipeng Xie Guandong Xu Xiujuan Xu Bing Xue Zijiang Yang Lina Yao Min Yao Dayong Ye Chao Yu Fusheng Yu Qi Yu Osmar R. Zaïane Chunxia Zhang Yanshan University, China Sun Yat-Sen University, China Sun Yat-Sen University, China Victoria University, Australia University of Siena, Italy Kyushu Institute of Information Sciences and Kyushu University Library, Japan Hiroshima University, Japan University of Newcastle, Australia University of Technology Sydney, Australia University of Illinois at Urbana-Champaign, USA San Diego Supercomputer Center, UCSD, USA Wuhan University, China RMIT University, Australia University of Huddersfield, UK University of Adelaide, Australia Politecnico di Bari, Italy VIT University, India University of Canberra, Australia University of Adelaide, Australia University of Adelaide, Australia Peking University, Beijing, China Yamaguchi University, Japan Google, USA Harbin Institute of Technology, China Carnegie Mellon University, USA Beijing Institute of Technology, China University of Adelaide, Australia Fudan University, China University of Technology, Sydney, Australia Nanjing University of Finance and Economics, China Dalian University of Technology, China Fudan University, China University of Technology Sydney, Australia Dalian University of Technology, China Victoria University of Wellington, New Zealand York University, UK University of New South Wales, Australia Zhejiang University, China University of Wollongong, Australia University of Wollongong, Australia Beijing Normal University, China Rochester Institute of Technology, USA University of Alberta, Canada Beijing Institute of Technology, China IX X Organization Guang Lan Zhang Mengjie Zhang Shichao Zhang Wei Emma Zhang Xin Zhao Nenggan Zheng Yong Zheng Mingyang Zhong Hill Zhu Xiaofeng Zhu Yi Zhuang Boston University, USA Victoria University of Wellington, New Zealand Guangxi Normal University, China University of Adelaide, Australia University of Queensland, Australia Zhejiang University, China Illinois Institute of Technology, USA University of Queensland, Australia Florida Atlantic University, USA Guangxi Normal University, China Zhejiang Gongshang University, China Additional Reviewers Al-Sahaf, Harith Bai, Xiaomei Bekele, Teshome Megersa Boyu, Zhang Chen, Qi Dam, Thu-Lan Fan, Wei Gao, Qian Gong, Saisai Gueniche, Ted Hanane, Amirat Keskin, Derin Li, Xin Liu, Shaowu Lu, Guangquan Nguyen, Bach Nguyen, Thuong Peng, Jiajie Qian, Xinqi Quang Huy, Duong Rahim, Azizur Siers, Michael Sun, Zequn Vu, Huy Quan Wan, Yao Wang, Biying Wang, Dongjing Wang, Hongtao Wang, Qingxiang Wang, Wei Wu, Cheng Wei Wu, Runze Xu, Peipei Yang, Shuiqiao Yu, Shuo Zhang, Boyu Zhang, Yongshan Zhao, Qi Contents Spotlight Research Papers Effective Monotone Knowledge Integration in Kernel Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Bartley, Wei Liu, and Mark Reynolds 3 Textual Cues for Online Depression in Community and Personal Settings . . . Thin Nguyen, Svetha Venkatesh, and Dinh Phung 19 Confidence-Weighted Bipartite Ranking. . . . . . . . . . . . . . . . . . . . . . . . . . . Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz 35 Mining Distinguishing Customer Focus Sets for Online Shopping Decision Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lu Liu, Lei Duan, Hao Yang, Jyrki Nummenmaa, Guozhu Dong, and Pan Qin Community Detection in Networks with Less Significant Community Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ba-Dung Le, Hung Nguyen, and Hong Shen 50 65 Prediction-Based, Prioritized Market-Share Insight Extraction . . . . . . . . . . . . Renato Keshet, Alina Maor, and George Kour 81 Interrelationships of Service Orchestrations. . . . . . . . . . . . . . . . . . . . . . . . . Victor W. Chu, Raymond K. Wong, Fang Chen, and Chi-Hung Chi 95 Outlier Detection on Mixed-Type Data: An Energy-Based Approach . . . . . . . Kien Do, Truyen Tran, Dinh Phung, and Svetha Venkatesh 111 Low-Rank Feature Reduction and Sample Selection for Multi-output Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shichao Zhang, Lifeng Yang, Yonggang Li, Yan Luo, and Xiaofeng Zhu Biologically Inspired Pattern Recognition for E-nose Sensors . . . . . . . . . . . . Sanad Al-Maskari, Wenping Guo, and Xiaoming Zhao Addressing Class Imbalance and Cost Sensitivity in Software Defect Prediction by Combining Domain Costs and Balancing Costs . . . . . . . . . . . . Michael J. Siers and Md Zahidul Islam 126 142 156 XII Contents Unsupervised Hypergraph Feature Selection with Low-Rank and Self-Representation Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei He, Xiaofeng Zhu, Yonggang Li, Rongyao Hu, Yonghua Zhu, and Shichao Zhang 172 Improving Cytogenetic Search with GPUs Using Different String Matching Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chantana Chantrapornchai and Chidchanok Choksuchat 188 CEIoT: A Framework for Interlinking Smart Things in the Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ali Shemshadi, Quan Z. Sheng, Yongrui Qin, and Ali Alzubaidi 203 Adopting Hybrid Descriptors to Recognise Leaf Images for Automatic Plant Specie Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ali A. Al-kharaz, Xiaohui Tao, Ji Zhang, and Raid Lafta 219 Efficient Mining of Pan-Correlation Patterns from Time Course Data . . . . . . Qian Liu, Jinyan Li, Limsoon Wong, and Kotagiri Ramamohanarao Recognizing Daily Living Activity Using Embedded Sensors in Smartphones: A Data-Driven Approach . . . . . . . . . . . . . . . . . . . . . . . . . Wenjie Ruan, Leon Chea, Quan Z. Sheng, and Lina Yao Dynamic Reverse Furthest Neighbor Querying Algorithm of Moving Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bohan Li, Chao Zhang, Weitong Chen, Yingbao Yang, Shaohong Feng, Qiqian Zhang, Weiwei Yuan, and Dongjing Li 234 250 266 Research Papers Relative Neighborhood Graphs Uncover the Dynamics of Social Media Engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Natalie Jane de Vries, Ahmed Shamsul Arefin, Luke Mathieson, Benjamin Lucas, and Pablo Moscato 283 An Ensemble Approach for Better Truth Discovery . . . . . . . . . . . . . . . . . . . Xiu Susie Fang, Quan Z. Sheng, and Xianzhi Wang 298 Single Classifier Selection for Ensemble Learning . . . . . . . . . . . . . . . . . . . . Guangtao Wang, Xiaomei Yang, and Xiaoyan Zhu 312 Community Detection in Dynamic Attributed Graphs . . . . . . . . . . . . . . . . . Gonzalo A. Bello, Steve Harenberg, Abhishek Agrawal, and Nagiza F. Samatova 329 Contents XIII Secure Computation of Skyline Query in MapReduce . . . . . . . . . . . . . . . . . Asif Zaman, Md. Anisuzzaman Siddique, Annisa, and Yasuhiko Morimoto 345 Recommending Features of Mobile Applications for Developer . . . . . . . . . . Hong Yu, Yahong Lian, Shuotao Yang, Linlin Tian, and Xiaowei Zhao 361 Adaptive Multi-objective Swarm Crossover Optimization for Imbalanced Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinyan Li, Simon Fong, Meng Yuan, and Raymond K. Wong 374 Causality-Guided Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mandar S. Chaudhary, Doel L. Gonzalez II, Gonzalo A. Bello, Michael P. Angus, Dhara Desai, Steve Harenberg, P. Murali Doraiswamy, Fredrick H. M. Semazzi, Vipin Kumar, and Nagiza F. Samatova, for the Alzheimer’s Disease Neuroimaging Initiative 391 Temporal Interaction Biased Community Detection in Social Networks . . . . . Noha Alduaiji, Jianxin Li, Amitava Datta, Xiaolu Lu, and Wei Liu 406 Extracting Key Challenges in Achieving Sobriety Through Shared Subspace Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haripriya Harikumar, Thin Nguyen, Santu Rana, Sunil Gupta, Ramachandra Kaimal, and Svetha Venkatesh 420 Unified Weighted Label Propagation Algorithm Using Connection Factor . . . Xin Wang, Songlei Jian, Kai Lu, and Xiaoping Wang 434 MetricRec: Metric Learning for Cold-Start Recommendations . . . . . . . . . . . . Furong Peng, Xuan Lu, Jianfeng Lu, Richard Yi-Da Xu, Cheng Luo, Chao Ma, and Jingyu Yang 445 Time Series Forecasting on Engineering Systems Using Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dongxu Shao, Tianyou Zhang, Kamal Mannar, and Yue Han 459 EDAHT: An Expertise Degree Analysis Model for Mass Comments in the E-Commerce System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiang Zhong, You Xiong, Weili Guo, and Jingyi Xie 472 A Scalable Document-Based Architecture for Text Analysis . . . . . . . . . . . . . Ciprian-Octavian Truică, Jérôme Darmont, and Julien Velcin DAPPFC: Density-Based Affinity Propagation for Parameter Free Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanning Yuan, Shuliang Wang, Yang Yu, and Ming Zhong 481 495 XIV Contents Effective Traffic Flow Forecasting Using Taxi and Weather Data . . . . . . . . . Xiujuan Xu, Benzhe Su, Xiaowei Zhao, Zhenzhen Xu, and Quan Z. Sheng Understanding Behavioral Differences Between Short and Long-Term Drinking Abstainers from Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . Haripriya Harikumar, Thin Nguyen, Sunil Gupta, Santu Rana, Ramachandra Kaimal, and Svetha Venkatesh Discovering Trip Hot Routes Using Large Scale Taxi Trajectory Data . . . . . . Linjiang Zheng, Qisen Feng, Weining Liu, and Xin Zhao Discovering Spatially Contiguous Clusters in Multivariate Geostatistical Data Through Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francky Fouedjio On Improving Random Forest for Hard-to-Classify Records . . . . . . . . . . . . . Md Nasim Adnan and Md Zahidul Islam Scholarly Output Graph: A Graphical Article-Level Metric Indicating the Impact of a Scholar’s Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Liu, Dan Lin, Jing Li, and Shimin Shan Distributed Lazy Association Classification Algorithm Based on Spark . . . . . Xueming Li, Chaoyang Zhang, Guangwei Chen, Xiaoteng Sun, Qi Zhang, and Haomin Yang Event Evolution Model Based on Random Walk Model with Hot Topic Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chunzi Wu, Bin Wu, and Bai Wang 507 520 534 547 558 567 580 591 Knowledge-Guided Maximal Clique Enumeration . . . . . . . . . . . . . . . . . . . . Steve Harenberg, Ramona G. Seay, Gonzalo A. Bello, Rada Y. Chirkova, P. Murali Doraiswamy, and Nagiza F. Samatova 604 Got a Complaint?- Keep Calm and Tweet It! . . . . . . . . . . . . . . . . . . . . . . . Nitish Mittal, Swati Agarwal, and Ashish Sureka 619 Query Classification by Leveraging Explicit Concept Information . . . . . . . . . Fang Wang, Ze Yang, Zhoujun Li, and Jianshe Zhou 636 Stabilizing Linear Prediction Models Using Autoencoder . . . . . . . . . . . . . . . Shivapratap Gopakumar, Truyen Tran, Dinh Phung, and Svetha Venkatesh 651 Mining Source Code Topics Through Topic Model and Words Embedding. . . Wei Emma Zhang, Quan Z. Sheng, Ermyas Abebe, M. Ali Babar, and Andi Zhou 664 Contents XV IPC Multi-label Classification Based on the Field Functionality of Patent Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sora Lim and YongJin Kwon 677 Unsupervised Component-Wise EM Learning for Finite Mixtures of Skew t-distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sharon X. Lee and Geoffrey J. McLachlan 692 Supervised Feature Selection by Robust Sparse Reduced-Rank Regression . . . Rongyao Hu, Xiaofeng Zhu, Wei He, Jilian Zhang, and Shichao Zhang PUEPro: A Computational Pipeline for Prediction of Urine Excretory Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Wang, Wei Du, Yanchun Liang, Xin Chen, Chi Zhang, Wei Pang, and Ying Xu 700 714 Partitioning Clustering Based on Support Vector Ranking . . . . . . . . . . . . . . Qing Peng, Yan Wang, Ge Ou, Yuan Tian, Lan Huang, and Wei Pang 726 Global Recursive Based Node Importance Evaluation . . . . . . . . . . . . . . . . . Lu Zhao, Li Xiong, and Shan Xue 738 Extreme User and Political Rumor Detection on Twitter . . . . . . . . . . . . . . . Cheng Chang, Yihong Zhang, Claudia Szabo, and Quan Z. Sheng 751 Deriving Public Sector Workforce Insights: A Case Study Using Australian Public Sector Employment Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shameek Ghosh, Yi Zheng, Thorsten Lammers, Ying Ying Chen, Carolyn Fitzmaurice, Scott Johnston, and Jinyan Li Real-Time Stream Mining Electric Power Consumption Data Using Hoeffding Tree with Shadow Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simon Fong, Meng Yuen, Raymond K. Wong, Wei Song, and Kyungeun Cho Real-Time Investigation of Flight Delays Based on the Internet of Things Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdulwahab Aljubairy, Ali Shemshadi, and Quan Z. Sheng 764 775 788 Demo Papers IRS-HD: An Intelligent Personalized Recommender System for Heart Disease Patients in a Tele-Health Environment . . . . . . . . . . . . . . . . . . . . . . Raid Lafta, Ji Zhang, Xiaohui Tao, Yan Li, and Vincent S. Tseng 803 XVI Contents Sentiment Analysis for Depression Detection on Social Networks . . . . . . . . . Xiaohui Tao, Xujuan Zhou, Ji Zhang, and Jianming Yong 807 Traffic Flow Visualization Using Taxi GPS Data . . . . . . . . . . . . . . . . . . . . Xiujuan Xu, Zhenzhen Xu, and Xiaowei Zhao 811 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815 Spotlight Research Papers Effective Monotone Knowledge Integration in Kernel Support Vector Machines Christopher Bartley(B) , Wei Liu, and Mark Reynolds School of Computer Science, University of Western Australia, Perth, WA 6009, Australia christopher.bartley@research.uwa.edu.au http://web.csse.uwa.edu.au Abstract. In many machine learning applications there exists prior knowledge that the response variable should be increasing (or decreasing) in one or more of the features. This is the knowledge of ‘monotone’ relationships. This paper presents two new techniques for incorporating monotone knowledge into non-linear kernel support vector machine classifiers. Incorporating monotone knowledge is useful because it can improve predictive performance, and satisfy user requirements. While this is relatively straight forward for linear margin classifiers, for kernel SVM it is more challenging to achieve efficiently. We apply the new techniques to real datasets and investigate the impact of monotonicity and sample size on predictive accuracy. The results show that the proposed techniques can significantly improve accuracy when the unconstrained model is not already fully monotone, which often occurs at smaller sample sizes. In contrast, existing techniques demonstrate a significantly lower capacity to increase monotonicity or achieve the resulting accuracy improvements. 1 Introduction Prior domain knowledge has many forms, such as knowledge of class invariance in regions of the input space [15] (e.g. if tumor > 4 ∧ lymphnodes > 5 =⇒ Recurrence), class invariance under transformations of the input [12] (used mostly for image processing) and shape knowledge such as convexity [24] and monotonicity [2,13]. We focus in this paper on monotone knowledge. A monotone relationship between X and Y means that an increase in X should not lead to a decrease in Y. For example, a house with three bedrooms should not be cheaper than one with two bedrooms (all other factors constant). As the base algorithm we use the flexible and powerful kernel Support Vector Machine (SVM). Although this is an older algorithm dating back to 1995 [3], its power and flexibility was recently (somewhat surprisingly) reasserted by the Radial Basis Function (RBF) kernel variant achieving second place in a comprehensive test of 179 classifiers on 121 datasets [8], despite competing against many newer algorithms. Thus we use the Radial Basis Function kernel in this c Springer International Publishing AG 2016  J. Li et al. (Eds.): ADMA 2016, LNAI 10086, pp. 3–18, 2016. DOI: 10.1007/978-3-319-49586-6 1 4 C. Bartley et al. paper, although our technique is applicable to other Mercer kernels such as sigmoid or polynomial. By extending kernel SVM we seek to improve on one of the best non-parametric classifiers. While a monotone SVM algorithm (MC-SVM) was proposed in 2014 by Li and Chen in [2,13], this paper proposes a new framework for how the monotone knowledge is integrated, called partially monotone SVM (PM-SVM). Our experiments suggest Li and Chen’s techniques yield only limited increases in monotonicity, and as a result the accuracy improvements are also limited. In contrast our techniques substantially increase monotonicity, and hence also result in substantial accuracy increases. We hope to stimulate interest in this area and have provided a matlab implementation of the algorithm on github1 . This paper is organised as follows. Section 2 outlines partially monotone classification and SVM. Section 3 is our core contribution of a measure for partial monotonicity and new techniques for integrating it into SVM. Section 4 describes the experiments and Sect. 5 reports the results, concluding in Sect. 6. 2 Background 2.1 Partially Monotone Classification Monotone prior knowledge is easy to obtain from experts or domain knowledge for many types of problems, and is informative without being overly prescriptive. Monotonicity between input variable xj and output y may be defined as: For an increase in variable xj , variable y should not decrease (all other variables held constant). Monotonicity can apply to problems where the model output is ordered, such as regression and ordinal classification. This paper considers ordinal classification, where the task is to assign objects to two or more classes when there is an order assigned to the classes, but the distances between classes are irrelevant. Examples of ordinal classification include credit ratings (AAA, AA, A, BBB, ...), where a better rating may be considered ‘higher’. Similarly a cancer diagnosis of No/Yes may be considered ordered (in terms of the risk of being cancerous). The classification literature on monotonicity [2,5–7,13,16,17,19–21,23] ubiquitously defines monotonicity in the context of the dominance relation  (a partial ordering), which is defined as: x  x ⇔ xj ≥ xj , x, x ∈ m , j = 1..m (1) Thus a function f : m →  is defined as monotone if and only if: x  x ⇒ f (x) ≥ f (x ), ∀x, x ∈ m (2) This definition describes monotonicity in all features, and if some features are not necessarily monotone, they are sometimes removed prior to analysis 1 https://github.com/chriswbartley/PMSVM. Effective Monotone SVM 5 (e.g. [6,10]). For this paper we cater for the more general situation of ‘partial’ monotonicity in some features. We define partial monotonicity similarly to Kamp et al. [9], making the ceteris paribus assumption (all other features being equal): Definition 1 Partial Dominance. Given monotone features C ⊆ {1, ..., n}, the partial order C over x, x ∈ n is:  xj ≤ xj , ∀j ∈ C  x C x ⇔ (3) xj = xj , ∀j ∈ {1, ..., n}\C Definition 2 Partially Monotone Function. Function f : n →  is monotonic in C ⊆ {1, ..., n} if x C x ⇒ f (x) ≤ f (x ), ∀x, x ∈ n (4) This definition may also be considered as a multi-variate and deterministic form of the probabilistic definitions for First Order Stochastic Dominance (FSD) Monotonicity as proposed by Wellman as a qualitative influence (S + ) [25]. In this context the classifier, as a deterministic function, has the degenerate probability distribution and cdf (f (x)) F SD cdf (f (x )) ⇔ f (x) ≥ f (x ). 2.2 Monotone Support Vector Machines The existing monotone SVM techniques are of the ‘constrained optimisation’ type and use hard discrete constraints on the solution space. The first such SVM was by Pelckmans et al. in 2005 [18] and incorporated inequality constraints on the SVM primal problem. However it was limited to ordered datasets, with all data-points monotone increasing in all features. In 2014 Chen and Li [2] reframed the Pelckmans et al. approach to allow arbitrary constraints, which meant the data points no longer needed to be monotone ascending. Li and Chen also extended the technique to Fuzzy MC-SVM [13]. The constrained SVM used in MC-SVM is as follows. The SVM classifier is f (x) = sign[wT ψ(x) + b], where ψ : X → H is a transformation into a potentially higher or even infinite dimension space. The optimal model is then: N min w,e  1 T w w+B ek 2 subject to (5) k=1 T yk (w ψ(xk ) + b) ≥ 1 − ek , k = 1, 2, ..., N ek ≥ 0, k = 1, 2, ..., N Observing that wT ψ(x̃) ≥ wT ψ(x) =⇒ f (x̃) ≥ f (x) the constrained SVM  } and augments the takes a set of data point pairs M C = {(xi , x̃i ), i = 1..M  SVM problem with the constraints: wT ψ(x̃i ) ≥ wT ψ(xi ), i = 1, 2, ..., M  (6) 6 C. Bartley et al. The Lagrangian can then be constructed to include the constraint inequalities using the Karush Kahn Tucker conditions [11] and introducing multipliers βi in addition to the usual αk (and νk ): N L(w, b; α, β, ν) = N   1 T w w+B ek − αk (yk [wT ψ(xk ) + b] − 1 + ek ) 2 k=1 − k=1 M  N  βi (wT ψ(x̃i ) − wT ψ(xi )) − νk ek , (7)  i=1 k=1 αk , νk ≥ 0 ∀k = 1, ..., N, βi ≥ 0 ∀i = 1, ..., M The stationarity condition (setting partial derivatives to zero) reveals that N M w = k=1 αk yk ψ(xk ) + i=1 βi (ψ(x̃i ) − ψ(xi )) and νk = B − αk , and the  [2] for full working): solution for α and β is a quadratic problem (see   1 T T α max− [α , β ]G + 1T α β α,β 2 subject to N  (8) αk yk = 0, k=1 0 ≤ αk ≤ B, k = 1, 2, ..., N βi ≥ 0, i = 1, 2, ..., M  11 12  G G where G = G21 G22 T G11 k,l = yk yl ψ(xk ) ψ(xl ), k, l = 1, ..., N 21 T G12 k,i = Gi,k = yk (ψ(x̃i ) − ψ(xi )) ψ(xk ), k = 1, .., N, i = 1, .., M  T G22 i,j = (ψ(x̃i ) − ψ(xi )) (ψ(x̃j ) − ψ(xj )), i, j = 1, ..., M   The unknown mapping ψ is typically not solved directly but instead restricted to come from a dot product space described by a kernel function K(x, x ). As originally shown by Nachman Aronszajn in 1950, positive definite kernel functions K : X × X → , possessing both symmetry and having positive definite Gram matrices for all possible x1 , ...xm ∈ X , must necessarily have a corresponding space H (of dimension nH ) and mapping ψ : X → H where K(x, x ) = ψ(x), ψ(x )H = ψ(x)T ψ(x ). As a result, if the solution can be represented in only dot product terms ψ(x)T ψ(x ) these can be replaced by Effective Monotone SVM 7 K(x, x ), as a result of the ‘kernel trick’. The kernel used in this paper is the  2 || ). Other RBF (Gaussian) kernel (with parameter γ) Krbf (x, x ) = exp( −||x−x γ common kernels include polynomial, sigmoid and linear. The quadratic problem (8) can then be solved by standard quadratic programming solvers such as MATLAB or R’s quadprog. Problem convexity depends on G - if it is strictly positive definite, the solution is global and unique, and if positive semi-definite, the solution is only global. If it is indefinite, it may have no solution. Experimentally the addition of the constraints does sometimes cause G to be indefinite, with one or more negative eigenvalues. In this case as per [13] we used Tikhonov regularisation [22] to ensure G is positive definite, by adding σI to G, where σ = 2|min(eigenvalues)|. In addition to this constrained SVM algorithm, MC-SVM requires a method for generating the constraint set, and two such algorithms are proposed in [2,13]. Both these techniques are ‘conjunctive’ in that each pair typically varies in all constrained features. In this paper we refer to these as: 1. Randomised Conjunctive (CJ1) [2]: M constraints are created by repeating the following M times: (a) select a random subset of m training datapoints (m was unspecified and was set to 5 for our experiments); (b) set the constrained features of xi to the minimum values in the subset, and x̃i to the  unclear what was used for unconstrained features. maximum values. (It was We used the values of one of the m data-points selected at random.) 2. Uniform Partition Conjunctive (CJ2) [13]: This technique generates M constraints by partitioning each constrained feature into M equal partitions between its minimum and maximum. (The values used for unconstrained features were unclear, so to influence the densest part of the input space we set these to the data-point closest (in Euclidean distance) to the centroid as determined by a 1-cluster normalised k-means analysis.) We evaluate these MC-SVM techniques in Sect. 4. For reasons discussed in Sect. 3.1, we found that these techniques only increased monotonicity weakly, and as a result accuracy improvements were also low. 3 3.1 Partially Monotone Support Vector Machines PM-SVM Technique To address the limitations of the conjunctively constrained SVM by Li and Chen we propose the following procedure for PM-SVM: 1. Identify likely monotone features C from domain expertise or common sense; 2. Create constraints for the identified monotone features C, using one of the constraint generation techniques proposed in Algorithms 1 and 2; 3. Estimate optimal hyper-parameters (e.g. for SVM box constraint B and RBF kernel γ), for example by grid search and cross-validation; 8 C. Bartley et al. 4. Solve the monotonicity constrained C-SVM as per [2], using the training data and constraint set. PM-SVM differs from the implementation in [2] by proposing a new constraint generation technique designed to more efficiently achieve monotonicity. We start by observing that the inherent problem with attempting to discretise dominance relations throughout m in p monotone features is the sheer dimensionality of the space. The dimensionality is (m − p) + p + p = (m + p), to allow for all monotone combinations of the p monotone feature values throughout the m−p non-monotone feature space. The existing approaches CJ1 and CJ2 (Sect. 2.2) attempt to cover this space, using conjunctive constraints where all p monotone features increase between the first and second data point in each constraint. This is a critical issue, because the addition of M constraints to the SVM optimisation extends the matrix inversion by M , and thus has worst case complexity O((N + M )3 ), where N is the number of data-points and so increasing M can quickly make the problem practically infeasible. However, the actual dimensionality of the space to be covered is much lower. In Definition 2, x C x may be usefully considered to be made up of two cases: (a) strictly univariate changes (when exactly one feature ci ∈ C increases between x and x ), and (b) strictly conjunctive changes (when more than one feature {ci , cj ...} ⊆ C increases). Then it is straightforward to show that univariate monotonicity in all ci ∈ C is equivalent to the conjunctive monotonicity in any {ci , ...} ⊆ C as per Theorem 1: Theorem 1 (Equivalence of Univariate & Conjunctive Monotonicity). Given monotone features C ⊆ {1, ..., m} and function f : m → , the condition for conjunctive partial monotonicity in (4) is equivalent to: x ci x ⇒ f (x) ≤ f (x ), ∀x, x ∈ m , ci ∈ C (9) Proof: (i) (4) =⇒ (9): this follows since each c ∈ C is simply a special case of (3), where xj = xj for j ∈ (C\c). In other words, univariate monotonicity in each monotone feature is simply a subset of conjunctive monotonicity. (ii) (9) =⇒ (4): Let function f : m →  comply with (9) for univariate monotonicity for all c ∈ C. For convenience and without loss of generality let C = {1, 2, ..., cmax }, cmax ≤ m. Let x = (x1 , x2 , ..., xcmax , ..., xm−1 , xm ) and x = (x1 + δ1 , x2 + δ2 , ..., xcmax + δcmax , ..., xm−1 , xm ), where δc ≥ 0. It is apparent that x C x , ∀x, x ∈ m thus encompassing (4). We will now show that f (x ) ≥ f (x) using only (9): f (x ) = f ((x1 + δ1 , x2 + δ2 , ..., xcmax + δcmax , ..., xm−1 , xm )) = f ((x1 , x2 + δ2 , ..., xcmax + δcmax , ..., xm−1 , xm ) + (δ1 , 0, ..., 0, ..., 0, 0)) ≥ f (x1 , x2 + δ2 , ..., xcmax + δcmax , ..., xm−1 , xm ) = f ((x1 , x2 , ..., xcmax + δcmax , ..., xm−1 , xm ) + (0, δ2 , ..., 0, ..., 0, 0)) ≥ f (x1 , x2 , ..., xcmax + δcmax , ..., xm−1 , xm ) ... ≥ f (x1 , x2 , ..., xcmax , ..., xm−1 , xm ) = f (x) Effective Monotone SVM 9 Thus compliance with conjunctive monotonicity (4) in monotone features C for any x, x ∈ m , x C x is demonstrated for any function f : m →  complying with univariate monotonicity (9). Thus it is sufficient to aim for univariate monotonicity in each monotone feature, with dimension (m − 1) + 1 + 1 = (m + 1). This means that any strictly conjunctive constraints (where more than one feature increases) are redundant in the presence of univariate constraints. Hence we propose techniques for univariate constraint generation. MATLAB implementations are available on github. The first technique (Algorithm 1) simply selects T random training points {xi | i = t1 ..tT } for each constrained feature cj and creates two constraints per point, a ‘lower’ one between the (xi,1 , .. mini (xi,cj ), ..xi,m ) and xi , and a ‘higher’ one between xi and (xi,1 , .. maxi (xi,cj ), ..xi,m ). We could simply set T to the number of training points for maximum coverage, but for the experimental data sets we found that monotonicity was adequate with T = 25. Algorithm 1. Univariate Random Constraint Generation (UNR) Input: Training data {xi = (xi,1 , ..xi,m ) | i = 1..N }, monotone features C = {cj | j = 1..p}, number of base points per monotone feature T, T < N . Output: Set of M constraint pairs {(xk , x̃k ) | k = 1..M }, M ≤ 2pT 1: Initialise set of constraints τ = {}  2: for each cj ∈ C do 3: Calculate minimum and maximum values for cj (mini (xi,cj ), maxi (xi,cj )) 4: Choose T points at random from training data {(xi , yi ) | i = t1 ..tT }. 5: Append T lower constraints to τ : {((xi,1 , .. mini (xi,cj ), ..xi,m ), xi ) | i = t1 ..tT } 6: Append T upper constraints to τ : {(xi , (xi,1 , .. maxi (xi,cj ), ..xi,m )) | i = t1 ..tT } 7: end for 8: Remove any duplicate constraints from τ (with identical xk and x̃k ).  9: Return τ The second technique (Algorithm 2 - Adaptive Constraint Generation) uses the unconstrained SVM model to identify non-monotone regions and designs constraints to address them. Figure 1 illustrates what we term a ‘Non-Monotone Region’ (NMR) around a point. Note from this figure that given f (x) is a binary function m ⇒ {1, 2} and that f (xnmt ) = 2 and feature i is monotone increasing, we only need to look for non-monotonicity in values of feature i that are higher than xnmt,i . Evaluating f (x) for higher values of feature i identifies a non-monotone hyperplane (when f (x) = 1), until at higher values we reach the ‘extent’ hyperplane (when f (x) = 2 again). To correct this non-monotone region, either the ‘active NMR’ (containing xnmt ) or the ‘passive NMR’ (not-containing xnmt ) need to fully reverse. We can encourage this correction by placing constraints between the active and passive NMRs. Although a number of discrete constraints could be used to span these regions and ensure non-monotone ‘islands’ do not appear in the gaps, in practice we found that it sufficient to place a single constraint between the non-monotone 10 C. Bartley et al. Fig. 1. Adaptive Constraint Generation. Around each point that is non-monotone in feature Xi is a ‘Non-Monotone Region’ (NMR) composed of an ‘Active’ NMR (containing the point) and a ‘Passive’ NMR (of the alternative class). An adaptive constraint is created between the non-monotone point and the midpoint of its ‘Passive’ NMR. point and the mid-point of the passive NMR. We do note however that it is important to include xnmt in the constraint, because the sparse SVM solution may select xnmt as a support vector and indeed create a non-monotone island. 3.2 Measurement of Partial Monotonicity The practicality of this approach depends on achieving satisfactory levels of monotonicity with reasonable numbers of constraints. To evaluate this we need a measure for the partial monotonicity of a classifier. The dominant measure in the classification literature is to conduct all possible pairwise comparisons of the data points and their predictions. The measure of (non)monotonicity is the proportion of (non)violations, and has various names such as Frequency Monotonicity Rate (FMR) [2,13], degree of monotonicity (NmDeg) [6,21], Nonmonotonicity Index (NMHI) [16,17], fraction of monotone pairs [5,23]. Each of the data point comparisons (N 2 − N excluding self-comparisons) is defined as ‘comparable’ if the points comply with the dominance relation (1): x comparable with x ⇔ (xj ≥ xj , ∀j = 1..m) OR (xj ≤ xj , ∀j = 1..m) (10) The ‘incomparable’ pairs (IP ) are ignored. The comparable pairs (CP ) are then assessed for compliance with monotonicity of the class predicted by the function or given by the data set (Eq. 2). Comparable pairs that do not comply with this equation (N M ) are deemed non-monotone, and typically these are represented as a proportion of the total number of comparisons, for example the Non-Monotonicity Index (NMI), which is given by N M I = N M/(N 2 + N ). This measure is useful for monotonicity in all features, but in practice cannot be used for our situation of partial monotonicity, because in addition to requiring that the constrained features are either all greater than (or less than) or equal, we must modify Eq. 10 to require equality in all unconstrained features. This dramatically reduces the number of comparable pairs, such that most of Effective Monotone SVM 11 Algorithm 2. Adaptive Constraint Generation (AD) Input: Training data {xi = (xi,1 , .., xi,m ) | i = 1..N }, monotone features C = {cj | j = 1..p}, trained classifier f (x), continuous variable partitions L > 0 Output: Set of M constraint pairs {(xk , x̃k ) | k = 1..M }, M ≤ N 1: Initialise empty set of constraints τ = {} 2: for each cj ∈ C do 3: Calculate minimum and maximum values for cj (mini (xi,cj ), maxi (xi,cj )) 4: Set partition width P = (maxi (xi,cj ) − mini (xi,cj ))/L 5: Set feature grid vector G = (mini (xi,cj ) : P : maxi (xi,cj )) 6: for each xi (i = 1..N ) do 7: Set potential non-monotone dirn ψd as > (above) if f (xi ) > 0 else < (below). 8: Initialise class = f (xi ), hyperplane = N U LL, extent = N U LL 9: for each Gi ψd xi,cj do 10: if f ((xi,1 , ..xi,cj = Gi , ..., xi,m )) = class then 11: if hyperplane IS N U LL then 12: hyperplane = Gi , class = f ((xi,1 , ..., xi,cj = Gi , ..xi,m )). 13: else 14: extent = Gi 15: end if 16: end if 17: if N OT hyperplane IS N U LL then 18: if extent IS N U LL then extent = G1 or (GL+1 ) as appropriate 19: Set midpt = (hyperplane + extent)/2 20: Append constraint to τ : (xi , (xi,1 , ..xcj = midpt, ..xi,m )) 21: end if 22: end for 23: end for 24: end for 25: Return τ the datasets in this paper had a comparability of 0.0 %. In practice even one continuous unconstrained feature will typically reduce comparability to zero. Even if equality is relaxed, comparability remains low, such as the Pima dataset [14], which only has categorical variables and still has comparability of only 4.6 %. This makes the N M I measure either impossible, or at least unreliable, in partially monotone situations. We thus propose a new measure function monotonicity we call Monotonicity Compliance (MCC). Applying Theorem 1, we observe that if a function possesses univariate partial monotonicity in all constrained features ci ∈ C (i = 1..p), it also possesses conjunctive monotonicity for any subset Cs ⊆ C. Thus we propose to simply measure univariate partial monotonicity for each ci ∈ C (M CCci ), and p use the feature average M CCf eatavg = p1 i=1 M CCci as a measure of overall compliance. MCC is intuitively defined as the proportion of the input space where the requested monotonicity constraints are not violated, weighted by the joint probability distribution of the input space. This aims to provide a practical measure of how monotone a function is within the expected input space. 12 C. Bartley et al. Definition 3 Monotonicity Compliance (MCC). For function f (x) : m ⇒ , x ∈ X with monotone constraints on features C = {c1 , ..., cp } ⊆ {1, ..., m}, and with joint probability density function P (x), the Monotonicity Compliance of f with respect to constrained feature ci is   M CCci (f ) = · · · P (x1 , . . . , xm )mci (f, x1 , . . . , xm ) dx1 . . . dxm (11) X where, for increasing monotonicity (for decreasing reverse as appropriate): ⎧ + − ⎪ ⎨1, if mci (f, x) ≥ 0 and mci (f, x) ≥ 0 − mci (f, x) = 12 , if (m+ ci (f, x), mci (f, x)) ∈ {(−1, +1), (+1, −1)} ⎪ ⎩ 0, otherwise (12) ⎧ +1, if ∃ Q > 0 s.t. f (x) < f (x1 , . . . xci + Q, . . . xm ) and ⎪ ⎪ ⎪ ⎨ f (x) = f (x1 , . . . , xci + q, . . . xm ) ∀ 0 < q < Q m+ ci (f, x) = ⎪ 0, if f (x) = f (x1 , . . . xci + q, . . . xm ) ∀ q > 0 ⎪ ⎪ ⎩ −1, otherwise ⎧ +1, if ∃ Q > 0 s.t. f (x) > f (x1 . . . , xci − Q . . . xm ) and ⎪ ⎪ ⎪ ⎨ f (x) = f (x1 , . . . , xci − q, . . . xm ) ∀ 0 < q < Q m− ci (f, x) = ⎪ 0, if f (x) = f (x1 , . . . , xci − q, . . . xm ) ∀ q > 0 ⎪ ⎪ ⎩ −1, otherwise (13) (14) Essentially the mci (x) function looks in the positive and negative directions of xci for the first change in f (if any). If in both of directions the function f either does not change, or changes in the correct direction, mci (x) returns 1 (a monotone point). If the change in one direction is correct and the other is incorrect, mci (x) returns 1/2 (it is ‘half’ monotone). Otherwise, the point is non-monotone and mci (x) = 0. In practice P (x) is unknown, but for given data set X̂ of size N , M CCci (f ) can be simply estimated by the plug-in estimate: M ˆCC ci (f ) = N 1  mc (f, xi ) N i=1 i (15) MCC is analogous to the partial derivative based technique in [4], but for non-continuous and non-differentiable functions. Although the partial derivative of the real valued SVM function (prior to taking the sign) could have been used, we preferred to evaluate class changes (i.e. after taking the sign). Partial derivatives are a poor indicator of classifier non-monotonicity because a negative slope does not necessarily result in a change in class (sign) over the input space. Effective Monotone SVM 4 13 Experiments and Datasets Datasets. Nine datasets were used as described in Table 1, from the UCI Machine Learning Repository [14] and the KEEL Dataset Repository [1]. Rows with missing values were removed. Table 1. Dataset summary table. Experiment Design. For each dataset we conducted 50 experiments. The RBF kernel and the MATLAB quadprog solver was used on an Intel i7-4790 8 GB RAM PC. For each experiment 2/3 of available data points were randomly selected as the maximum training partition. This training partition was sequentially randomly sub-sampled down to 200, 100, and 50 data point samples to assess the effect of sample size. All sampling was stratified to retain class distribution. For each sample size, the remainder of the available data were used as the test partition. The same training/test partitions and CV partitions were used for all constraint techniques to ensure a fair comparison. Four constraint techniques were evaluated: existing approaches CJ1 and CJ2, and the two proposed approaches UNR and AD. CJ2 was almost indistinguishable from unconstrained SVM and so the results were omitted for clarity. For UNR T = 25 was used, resulting in a maximum of 2pT constraints. For CJ1 2pT constraints were used, to enable like-for-like comparison with UNR. For AD, the number of constraints varies depending on the non-monotone regions identified. Each experiment proceeded as follows: 1. Estimate optimal standard SVM hyper-parameters: The RBF scale factor σ and SVM box constraint B were estimated using grid search on C ∈ {0.0001,0.001,0.01,0.1,1,5,10,50,100,500,1000,5000,10000,50000,100000}, σ ∈ {0.001,0.01,0.1,1,5,10,15,25,50,100,250,500,1000,5000}. The optimal C, σ pair had the minimum MCR as determined by stratified 10-fold crossvalidation. 14 C. Bartley et al. 2. Fit SVM model: The unconstrained SVM model is fitted to the whole training partition using the optimal hyper-parameters. 3. For each constraint algorithm: – Create constraints: Constraints were generated based on the training data as per the appropriate algorithm. – Fit constrained SVM model: The PM-SVM model was fitted to the training partition using the constraints and optimal SVM hyperparameters. – Estimate performance: Accuracy and monotonicity metrics were both calculated on the test partition. 5 Results and Discussion Classifier Partial Monotonicity. Figure 2 shows the impact of the constraint technique and sample size on classifier monotonicity (M CCf eatavg ). Firstly we note that the unconstrained car and WBCdiag models were almost perfectly monotone with no constraints! For the other datasets, it is clear that the two proposed approaches are more effective at inducing a monotone classifier than the existing CJ1 approach for all datasets. In fact CJ1 does not noticeably increase monotonicity except for SA Heart, Ljubjlana, Autompg and Haberman, where the increases were less than half those of the proposed techniques. It is most effective on Haberman, which is almost certainly due to that dataset’s low dimensionality (n = 3 features), which makes it more likely to generate effective univariate constraints. In contrast, the proposed random univariate UNR technique lifts monotonicity above 99 % for all data sets and sample sizes except Autompg and Haberman, where it is 97–98%. We emphasise that the monotonicity MCC measures were based on the test partition, so offer unbiased estimates. We find it somewhat surprising that so few discretised constraints (based on T = 25 random base points per constrained feature) can be so effective at inducing monotonicity throughout the input space. The Adaptive AD technique is generally similar to UNR, except for low sample sizes in Cleveland and SA Heart it is slightly less effective, and for higher sample sizes on Haberman it is more effective. Impact of Partial Monotonicity on Accuracy. The performance metric Cohen’s Kappa is summarised in Fig. 3. For the Car and WBCDiag datasets, the monotone models make negligable difference. This is because the unconstrained SVM is almost perfectly monotone (M CCf eatavg approaching 1, Fig. 2), so there is little room for improvement. Table 2 shows the experiment-wise increase in Accuracy and Cohen’s Kappa for N = 50, because monotonicity is most likely to improve accuracy at low sample sizes [9]. Apart from the Car and WBCDiag datasets (which are already virtually monotone without constraints), statistically significant accuracy and/or κ increases are seen for all remaining datasets. The existing CJ1 increases in κ and accuracy are typically a third or a half or less of the proposed approaches Effective Monotone SVM Fig. 2. MCC (feat avg) vs Sample Size Fig. 3. Cohen’s κ vs Sample Size. 90 % confidence interval shown. 15 16 C. Bartley et al. Table 2. Accuracy summary for N = 50. Shading shows significance at p = 0.05. respectively. The maximum increase is achieved for the Cleveland dataset with an increase in accuracy of 2.7 % (UNR) or 2.8 % (AD). Sample size effect can be seen in Fig. 3. Broadly the datasets can be grouped into three types. Car and WBCDiag show little improvement as discussed. Cleveland, SA Heart and German datasets show diminishing returns as the sample size increases, which is typically expected for incorporating domain knowledge. Interestingly, for the remaining datasets (Ljubjlana, Autompg, Haberman, Pima) the benefit from monotonicity is maintained with increasing sample size, even when the maximum training data is used, suggesting perhaps that noise is obscuring the monotone relationships rather than sample size. Comparing the proposed Adaptive (AD) and Randomised (UNR) approaches, although AD induced slightly lower monotonicity at smaller samples sizes, we note that the accuracy improvements remained similar or better. In addition AD has the benefits of (a) eliminating the question of how many constraints to use, and (b) a much lower number of constraints (typically 10– 20% in these experiments), which is particularly beneficial at higher sample sizes because of the polynomial cost of adding constraints (O(N + M )3 ). 6 Conclusions We have presented two new constraint generation techniques for incorporating domain knowledge regarding partial monotonicity into kernel SVM classifiers, and demonstrated that they are effective using a new measure for partial monotonicity. The techniques involve the creation of discretised constraint sets by either a random (Algorithm 1) or adaptive approach (Algorithm 2). In contrast to existing techniques, monotonicity is substantially increased with a reasonable number of constraints. This results in increases in accuracy in seven of the nine datasets, particularly at lower sample sizes. The lack of Effective Monotone SVM 17 increased accuracy in the remaining two datasets was also shown to be because the unconstrained model was already almost perfectly monotone and there was little room for improvement. Although discretised constraint approaches are not ideal in that they do not guarantee global monotonicity in the same way that model specification would, we believe the inelegance of our approach is worthwhile and necessary to extend such a powerful family nonlinear nonparametric classifiers. We note that the accuracy increases are particularly significant because RBF kernel C-SVM is already one of the most powerful classification algorithms [8]. Future exciting research avenues are extensions to multi-class, comparison with other monotone classifiers, and development of a more sophisticated monotone feature selection. References 1. Alcalá, J., Fernández, A., Luengo, J., Derrac, J., Garcı́a, S., Sánchez, L., Herrera, F.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 255–287 (2010) 2. Chen, C.C., Li, S.T.: Credit rating with a monotonicity-constrained support vector machine model. Expert Syst. Appl. 41(16), 7235–7247 (2014) 3. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 4. Daniels, H., Kamp, B.: Application of MLP networks to bond rating and house pricing. Neural Comput. Appl. 8(3), 226–234 (1999) 5. Duivesteijn, W., Feelders, A.: Nearest neighbour classification with monotonicity constraints. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 301–316. Springer, Heidelberg (2008). doi:10.1007/ 978-3-540-87479-9 38 6. Feelders, A., Pardoel, M.: Pruning for monotone classification trees. In: R. Berthold, M., Lenz, H.-J., Bradley, E., Kruse, R., Borgelt, C. (eds.) IDA 2003. LNCS, vol. 2810, pp. 1–12. Springer, Heidelberg (2003). doi:10.1007/ 978-3-540-45231-7 1 7. Feelders, A.J.: Prior knowledge in economic applications of data mining. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 395–400. Springer, Heidelberg (2000). doi:10.1007/3-540-45372-5 42 8. Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15(1), 3133–3181 (2014) 9. Kamp, R., Feelders, A., Barile, N.: Isotonic classification trees. In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, J.-F. (eds.) IDA 2009. LNCS, vol. 5772, pp. 405–416. Springer, Heidelberg (2009). doi:10.1007/978-3-642-03915-7 35 10. Kotlowski, W.: Statistical approach to ordinal classification with monotonicity constraints. Ph.D. thesis, Pozna Univ of Techn Inst of Computing Science (2008) 11. Kuhn, H.W., Tucker, A.W.: Nonlinear programming. In: Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. pp. 481–492. University of California Press, Berkeley, Calif. (1951) 12. Lauer, F., Bloch, G.: Incorporating prior knowledge in support vector machines for classification: a review. Neurocomputing 71(7), 1578–1594 (2008) 18 C. Bartley et al. 13. Li, S.T., Chen, C.C.: A regularized monotonic fuzzy support vector machine for data mining with prior knowledge. IEEE Trans. Fuzzy Syst. PP(99) (2014) 14. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ ml 15. Mangasarian, O.L., Wild, E.W.: Nonlinear knowledge-based classification. IEEE Trans. Neural Netw. 19(10), 1826–1832 (2008) 16. Marsala, C., Petturiti, D.: Rank discrimination measures for enforcing monotonicity in decision tree induction. Inf. Sci. 291, 143–171 (2015) 17. Milstein, I., David, A.B., Potharst, R.: Generating noisy monotone ordinal datasets. Artif. Intell. Res. 3(1), p30 (2013) 18. Pelckmans, K., Espinoza, M., De Brabanter, J., Suykens, J.A., De Moor, B.: Primal-dual monotone kernel regression. Neural Proc. Letters 22(2), 171–182 (2005) 19. Potharst, R., Ben-David, A., van Wezel, M.: Two algorithms for generating structured and unstructured monotone ordinal data sets. Eng. App. Art. Intell. 22(4), 491–496 (2009) 20. Potharst, R., Bioch, J.C.: Decision trees for ordinal classification. Intell. Data Anal. 4(2), 97–111 (2000) 21. Potharst, R., Feelders, A.J.: Classification trees for problems with monotonicity constraints. ACM SIGKDD Explor. Newsletter 4(1), 1–10 (2002) 22. Tikhonov, A., Arsenin, V.: Solutions of ill-posed problems. Scripta series in mathematics, Winston (1977) 23. Velikova, M., Daniels, H.: Decision trees for monotone price models. CMS 1(3–4), 231–244 (2004) 24. Wang, Y., Ni, H.: Multivariate convex support vector regression with semidefinite programming. Knowl.-Based Syst. 30, 87–94 (2012) 25. Wellman, M.P.: Fundamental concepts of qualitative probabilistic networks. Artif. Intell. 44(3), 257–303 (1990) Textual Cues for Online Depression in Community and Personal Settings Thin Nguyen(B) , Svetha Venkatesh, and Dinh Phung Deakin University, Geelong, Australia {thin.nguyen,svetha.venkatesh,dinh.phung}@deakin.edu.au Abstract. Depression is often associated with poor social skills. The Internet allows individuals who are depressed to connect with others via online communities, helping them to address the social skill deficit. While the difficulty of collecting data in traditional studies raises a bar for investigating the cues of depression, the user-generated media left by depression sufferers on social media enable us to learn more about depression signs. Previous studies examined the traces left in the posts of online depression communities in comparison with other online communities. This work further investigates if the content that members of the depression community contribute to the community blogs different from what they make in their own personal blogs? The answer to this question would help to improve the performance of online depression screening for different blogging settings. The content made in the two settings were compared in three textual features: affective information, topics, and language styles. Machine learning and statistical methods were used to discriminate the blog content. All three features were found to be significantly different between depression Community and Personal blogs. Noticeably, topic and language style features, either separately or jointly used, show strong indicative power in prediction of depression blogs in personal or community settings, illustrating the potential of using content-based multi-cues for early screening of online depression communities and individuals. Keywords: Computer mediated communication · Weblog · Social media analysis · Mental health · Textual cues · Affective norms · Language styles · Topics 1 Introduction In their lifetime, 12.1 % and 4.1 % of people have suicide ideation and attempts, respectively [15]. Psychiatry disorders were found in 90 % of suicide victims and the relative risk of suicide among those with the conditions was increased more than 9-fold [3]. Of the disorders, depression is among the major risk factor. Knowing what are the signs of depression would help detect early the sufferers and ultimately help alleviate the suicide crisis. The signals could be non-verbal activities, such as “lack of eye contact”, “head drooped, looking at ground”, and “mouth turned down”, which were found c Springer International Publishing AG 2016  J. Li et al. (Eds.): ADMA 2016, LNAI 10086, pp. 19–34, 2016. DOI: 10.1007/978-3-319-49586-6 2 20 T. Nguyen et al. as salient cues for depression [22]. Other markers of depression were vocal characteristics [11]. Social skills were also found as an indicator of depression [7]. Another aspect, language styles, such as the use of sadness or swearing words, was identified as depression cues for personal diaries and online blogs [19]. Most of these studies were conducted in small scale, such as for the study on the linguistic cues [19], the text was collected from fifty-seven participants. On the other hand, the Internet and social media and networking systems built on it offers great sources to investigate causes and cues for depression in large scale. The new media has provided an excellent venue for individuals to express their ideas and like-minded people to share stories, including depression sufferers. This unintentionally leaves a huge data which are often difficult to be collected in traditional studies. In this study, we collect data from depression.livejournal.com, the largest Live Journal community interested in “depression”. Investigations into online communities have often been at the community setting, such as in social capital and mood patterns [13,18]. However, it is still questionable whether blogging by members of depression communities in personal blogs is different from community blogs. For example, fillers or swearing are expected to be used more in personal than in community pages, probably suggesting different weights for the cues when predicting depression in the two settings. This paper examines if the content that members of the depression community contribute to the community blogs (Community) different from what they make in their own personal blogs (Personal )? In this study, the content is seen in three aspects: the topics discussed, the language styles expressed, and the affective information conveyed. These features were used as the base to detect online community [12], especially they have been found to be strong predictors of autism, and differentiate mental health communities from other online communities [14]. The features can be considered as potential text-based cues for depression. We present an analysis of a large scale cohort of data authored by more than 4,000 members of the depression community. The content made by members of the depression community in their own personal blogs is also examined in comparison with what they make in the community blog. The way to collect data unobtrusively from online sources, such as online depression communities and individuals in this study, provides a valuable alternative approach to research in mental health, avoiding the issue of privacy as in traditional data collection for clinical studies. The main contributions of this work are: (1) to introduce a relatively comprehensive view of the content made in depression personal and community blogs, regarding three aspects: sentiment information, language style, and topics of interest; (2) to propose an efficient approach to select features and do regression simultaneously, providing a set of powerful predictors of depression blogs in personal and community settings; and (3) to provide statistically reliable empirical evidence in a data-driven approach compared with small-scale questionnairebased method in psychology. The result shows the potential of the new media in Textual Cues for Online Depression in Community and Personal Settings 21 screening and monitoring of online depression blogging in both individual and community context. In addition, the same framework could be employed for atrisk individuals and communities. In a broad sense this work demonstrates the application of machine learning in medical practice and research. The remainder of this paper is organized as follows. Section 2 describes the methodology. Section 3 presents the difference of blogging in personal and community settings. Section 4 shows the performance of the classification of Community vs. Personal blogs. Section 5 discusses the limitation of the work and Sect. 6 concludes the paper. 2 2.1 Method Datasets In this paper Live Journal data was chosen since it allows people to create or join communities of interest, along with their own personal pages. In particular, to examine blogging within online depression communities, data from the largest community in Live Journal interested in “depression” – depression.livejournal.com – was crawled. The community is described in its profile as “a safe and open community for those experiencing depression”. The community was founded in December 2000 and as of September 2015, it has more than 7,000 members and 40 thousand posts. This is the Community cohort in this study. We then construct a control dataset. The posts made in personal blogs of members of the depression community were also crawled. This is the Personal cohort. Only those members who have made posts in both cohorts were taken into the experiments, resulting in a corpus of 25,012 community posts and 104,033 personal posts made by 4,439 users. To create a balanced dataset, the same number of posts (the smaller number of posts, 25,012, which appeared in the Community category) for Personal category was randomly selected into the study, resulting in a corpus of 50,024 posts. 2.2 Feature Sets To characterised the posts made in community or personal blogs, three types of features are extracted: (1) Topics: what are discussed in the posts? (2) Language styles: how the posts are expressed? and (3) Expressed emotion: the affective information is conveyed in the content. – Affective information: For the affective aspect, we used ANEW lexicon [2] to extract the sentiment conveyed in the content. Words in this lexicon are rated in term of valence, arousal, and dominance. The valence of words is on a scale of one, very unpleasant, to nine, very pleasant. The arousal is measured in the same scale, one for least active and nine for most active. The dominance is also in the same scale, ranging from submissive to dominant. 22 T. Nguyen et al. – LIWC features: We examined the proportions of words in psycho-linguistic categories as defined in the LIWC package [17]: linguistic, social, affective, cognitive, perceptual, biological, relativity, personal concerns and spoken.1 – Topics: to extract topics, latent Dirichlet allocation (LDA) [1] was used as a Bayesian probabilistic modeling framework. LDA extracts the probabilities p (vocabulary | topic) - that is, words in a topic, and then assigns a topic to each word in a document. For the inference part, we implemented Gibbs inference detailed in [10]. We set the number of topics to 50, run the Gibbs for 5000 samples and use the last Gibbs sample to interpret the results. 2.3 Statistical Testing Statistical tests were conducted to examine the difference between Community and Personal blogs in the use of each feature. For each of 50 topics, 68 linguistic categories, and three ANEW sentiment scores, two following hypotheses were tested: – H1 : meancomm = meanpers : the null hypothesis that, for the tested feature, the data made by the two examined populations, Community vs. Personal blogs, are samples from normal distributions with equal means. This test is conducted using the two-sample t-test, a parametric test. – H2 : mediancomm = medianpers : the null hypothesis that, for the tested feature, the data made by the two examined populations, Community vs. Personal blogs, are samples from continuous distributions with equal medians. This test is conducted using the Wilcoxon rank sum tests, a non-parametric test. The relative difference in the use of a feature is determined as dif f = meancomm − meanpers ∗ 100 (meancomm + meanpers )/2 (1) If dif f > 0, the feature is used more in Community than in Personal, and vice versa. 2.4 Classification Lasso as the Classifier and Feature Selector. Denote by B a corpus of N posts made in community or personal blogs. A document d ∈ B is denoted (d) as x(d) = [. . . , xi , . . .], a vector of features. The feature sets experimented in this work are topics, extracted through topic modeling (LDA) and language (d) styles (LIWC). When topics are the features, xi is the probability of topic i (d) in document d. If LIWC processes are the features, xi represents the quantity of the process i in document d. Our experimental design examines the effect of 1 http://www.liwc.net/descriptiontable1.php, retrieved Sept. 2015, cached: http://bit. ly/1PPbeSv. Textual Cues for Online Depression in Community and Personal Settings 23 these two feature sets in classifying a blog post into one of two target classes. Given a document d ∈ B, we predict if the document belongs to a Community or Personal blog based on the textual features x(d) . We are interested in not only which sets of features perform well in the classification but also which features in the sets are strongly predictive of depression. For this purpose, the least absolute shrinkage and selection operator (Lasso) [9], a regularized regression, is chosen. Lasso does logistic regression and selects features simultaneously, enabling an evaluation on both the prediction performance and the importance of each feature in the classification. Particularly, in prediction of community posts, Lasso assigns positive and negative weights to features associated with community and personal posts, respectively. To the features irrelevant to the prediction, Lasso assigns zero weight. Thus, by examining its weights, we can learn the importance of each feature in the prediction. The regularization parameter (λ) in the regression model is chosen such that it is the largest number and the accuracy is still within one standard error of the optimum (1se rule). This way prevents over-fitting since not too many features are included in the model while the accuracy of classification is still assured. For each run, we use five-fold cross-validation, that is, one held-out fold is used for testing and other four folds for training. Accuracy was used to evaluate the performance of the classifications. Other Classifiers. For comparison with the classification performed by Lasso, classifiers from other paradigms were also included: – Naive Bayes (NB): one of the probabilistic methods that construct the conditional probability distributions of underlying features given a class label. The classification on unseen cases is then done by comparing the class likelihood. – Support vector machines (SVM): a non-probabilistic binary classifier that finds the separating plane between two classes with maximal margins. – Logistic regression (LR): a non-regularized logistic regression model, as opposed to the regularized one as Lasso. These classifiers will perform the binary classifications of Community versus Personal posts, using LIWC, topics, and a combination of them as the feature sets. The accuracy is used to compare with that of Lasso on the same classifications. Table 1. The mean (±std) of sentiment scores for community and personal posts. Sentiment score Community Personal h 1 h2 Valence 5.29 ± 0.64 5.66 ± 0.62 1 1 Arousal 4.29 ± 0.34 4.23 ± 0.39 1 1 Dominance 5.33 ± 0.44 5.54 ± 0.42 1 1 24 T. Nguyen et al. Fig. 1. The mean of sentiment scores for posts made in community and personal blogs. 3 Depression in Community and Personal Settings In this section, the differences between Community and Personal blogs made by members of the depression community, with respect to the three feature sets, are discussed. 3.1 Affective Information Table 1 shows the mean of sentiment scores for Community and Personal posts, accompanied with the result of t-test and rank sum test. If the result is one then the test rejects the null hypothesis of equal mean or median at the 5 % significance level. Otherwise the tests fail to reject the null hypothesis. As shown in the table, both tests rejected the null hypotheses. Figure 1 shows that the mean valence of the community posts was lower than that of the personal posts. This means that the posts made in the community blogs have more unpleasant words and/or fewer pleasant words than those in the personal blogs. Similarly, the average dominance score of community posts was lower than those of personal posts. It is likely that as a member of a depression community, people use more submissive and/or less dominant words, in comparison with what they write in their own pages. On the other hand, the average arousal score of Community posts was slightly higher than that in Personal posts. This probably implies that members tend to use more high active words and/or less low active words in their posts. 3.2 Psycho-Linguistic Features Table 2 shows the mean of LIWC linguistic features for the posts made in community and personal blogs, as well as the result of the statistical tests. For a Textual Cues for Online Depression in Community and Personal Settings 25 Table 2. The mean (±std) of LIWC features for posts made in community and personal blogs. Except for wc and wps are for the number of words for each post and number of words for each sentence in a post, respectively, the values for other features are the percentage of the LIWC categories in a post. LIWC Community Personal h1 h2 LIWC Community Personal h1 h 2 wc 225 ± 260 238 ± 456 1 1 anger 1.38 ± 1.9 1.09 ± 2.2 1 1 wps 15.2 ± 19.5 14.8 ± 24.3 0 1 sad 1.44 ± 1.8 0.60 ± 1.4 1 1 dic 90.88 ± 6.5 82.38 ± 14.3 1 1 cogmech 19.56 ± 5.3 15.57 ± 6.5 1 1 sixltr 12.98 ± 5.2 14.36 ± 7.2 1 1 insight 3.28 ± 2.2 2.12 ± 2 1 1 funct 60.86 ± 7.2 52.86 ± 13.1 1 1 cause 1.80 ± 1.6 1.39 ± 1.7 1 1 pronoun 20.86 ± 5.0 16.56 ± 6.8 1 1 discrep 2.28 ± 1.9 1.69 ± 2.0 1 1 ppron 14.29 ± 4.5 11.34 ± 5.6 1 1 tentat 3.43 ± 2.6 2.64 ± 2.5 1 1 i 11.13 ± 4.6 7.60 ± 5.0 1 1 certain 1.73 ± 1.6 1.36 ± 1.7 1 1 we 0.28 ± 0.8 0.49 ± 1.2 1 1 inhib 0.53 ± 1.0 0.57 ± 1.5 1 1 you 1.10 ± 2.2 1.65 ± 3.1 1 1 incl 4.32 ± 2.4 4.08 ± 2.9 1 1 shehe 1.20 ± 2.1 1.12 ± 2.1 1 1 excl 3.88 ± 2.4 2.84 ± 2.5 1 1 they 0.58 ± 1.1 0.49 ± 1.0 1 1 percept 2.34 ± 2.1 2.45 ± 2.8 1 1 ipron 6.57 ± 3.2 5.22 ± 3.4 1 1 see 0.54 ± 1.0 0.92 ± 1.7 1 1 article 4.10 ± 2.3 4.85 ± 3.2 1 1 hear 0.54 ± 1.0 0.62 ± 1.4 1 1 verb 18.21 ± 4.6 14.93 ± 6.0 1 1 feel 1.17 ± 1.4 0.74 ± 1.4 1 1 auxverb 10.94 ± 3.6 9.01 ± 4.5 1 1 bio 2.95 ± 2.7 2.86 ± 3.5 1 1 past 3.39 ± 2.7 3.25 ± 3.1 1 1 body 0.82 ± 1.3 0.97 ± 1.9 1 0 present 12.55 ± 4.6 9.56 ± 5.2 1 1 health 1.41 ± 1.8 0.76 ± 1.6 1 1 future 0.94 ± 1.1 0.99 ± 1.4 1 1 sexual 0.49 ± 1.1 0.70 ± 1.7 1 1 adverb 6.62 ± 3.0 5.43 ± 3.6 1 1 ingest 0.34 ± 1.0 0.57 ± 1.7 1 1 preps 11.63 ± 3.5 11.02 ± 4.4 1 1 relativ 12.83 ± 4.8 13.19 ± 6.2 1 1 conj 7.04 ± 2.8 5.94 ± 3.4 1 1 motion 1.64 ± 1.5 1.91 ± 2.2 1 1 negate 2.82 ± 2.1 1.91 ± 2.0 1 1 space 4.97 ± 2.7 5.16 ± 3.4 1 1 quant 2.92 ± 2.0 2.56 ± 2.2 1 1 time 5.84 ± 3.3 5.77 ± 4.2 0 1 number 0.65 ± 1.0 0.73 ± 1.4 1 1 work 1.17 ± 1.7 1.68 ± 2.7 1 1 swear 0.52 ± 1.2 0.57 ± 1.6 1 1 achieve 1.35 ± 1.4 1.32 ± 1.8 1 1 social 8.14 ± 5.2 8.04 ± 5.8 0 1 leisure 0.64 ± 1.2 1.45 ± 2.4 1 1 family 0.40 ± 0.9 0.33 ± 1.0 1 1 home 0.34 ± 0.7 0.49 ± 1.3 1 1 friend 0.36 ± 0.8 0.31 ± 1.0 1 1 money 0.29 ± 0.8 0.56 ± 1.5 1 1 humans 0.71 ± 1.1 0.72 ± 1.4 0 1 relig 0.22 ± 0.8 0.37 ± 1.3 1 1 affect 7.54 ± 3.8 6.57 ± 4.6 1 1 death 0.37 ± 1.0 0.23 ± 1.0 1 1 posemo 3.04 ± 2.5 3.82 ± 3.5 1 1 assent 0.25 ± 0.7 0.63 ± 1.5 1 1 negemo 4.39 ± 3.2 2.68 ± 3.2 1 1 nonfl 0.18 ± 0.5 0.23 ± 0.7 1 1 anx 0.67 ± 1.1 0.36 ± 0.9 1 1 filler 0.02 ± 0.2 0.02 ± 0.4 0 1 26 T. Nguyen et al. Table 3. The mean (±std) of topic distribution for posts made in community and personal blogs. Topic Community Personal h1 h2 Topic Community Personal h 1 h2 T1 3.24 ± 3.3 1.56 ± 1.1 1 1 T26 2.14 ± 1.7 1.92 ± 1.7 1 1 T2 1.42 ± 0.8 1.93 ± 2.7 1 1 T27 2.41 ± 2.0 2.07 ± 1.7 1 1 T3 1.55 ± 0.8 1.81 ± 2.1 1 1 T28 1.75 ± 1.3 2.19 ± 2.5 1 1 T4 2.31 ± 2.3 1.76 ± 1.5 1 1 T29 1.55 ± 1.7 1.69 ± 2.2 1 1 T5 1.30 ± 0.8 1.67 ± 3.2 1 1 T30 1.73 ± 1.2 2.08 ± 2.7 1 1 T6 2.25 ± 1.7 1.86 ± 1.5 1 1 T31 1.81 ± 1.9 2.40 ± 4.0 1 1 T7 2.11 ± 2.1 2.24 ± 2.6 1 1 T32 1.85 ± 1.6 1.83 ± 1.6 0 1 T8 2.44 ± 2.1 1.75 ± 1.3 1 1 T33 2.09 ± 1.7 2.06 ± 1.9 0 1 T9 2.08 ± 1.9 2.03 ± 1.9 1 0 T34 1.60 ± 1.2 2.18 ± 2.7 1 1 T10 2.08 ± 1.8 1.90 ± 1.6 1 1 T35 2.28 ± 3.3 2.05 ± 2.0 1 1 T11 1.83 ± 1.6 2.08 ± 2.4 1 1 T36 2.26 ± 1.8 1.90 ± 1.5 1 1 T12 2.54 ± 2.2 1.79 ± 1.4 1 1 T37 1.75 ± 1.3 1.87 ± 1.9 1 1 T13 2.49 ± 2.2 1.85 ± 1.5 1 1 T38 2.13 ± 1.6 2.00 ± 1.9 1 1 T14 1.55 ± 0.9 2.06 ± 2.2 1 1 T39 1.90 ± 1.6 2.18 ± 2.3 1 1 T15 1.81 ± 1.4 2.30 ± 4.3 1 1 T40 2.45 ± 2.1 1.76 ± 1.3 1 1 T16 1.68 ± 1.2 1.89 ± 1.9 1 1 T41 1.86 ± 1.7 2.16 ± 2.3 1 1 T17 1.68 ± 1.4 2.31 ± 2.9 1 1 T42 1.86 ± 1.5 2.16 ± 2.2 1 1 T18 2.21 ± 1.6 2.01 ± 1.5 1 1 T43 1.79 ± 1.5 2.15 ± 2.4 1 1 T19 1.63 ± 1.2 2.29 ± 2.8 1 1 T44 1.72 ± 1.2 1.87 ± 2.1 1 1 T20 2.49 ± 2.7 2.05 ± 1.9 1 1 T45 1.63 ± 1.0 2.25 ± 3.6 1 1 T21 2.49 ± 2.0 1.74 ± 1.3 1 1 T46 1.52 ± 0.9 2.27 ± 3.2 1 1 T22 1.58 ± 1.6 2.08 ± 3.2 1 1 T47 2.09 ± 1.5 1.89 ± 1.4 1 1 T23 2.30 ± 2.1 1.90 ± 1.5 1 1 T48 1.61 ± 1.1 1.81 ± 1.9 1 1 T24 2.19 ± 2.3 2.10 ± 2.4 1 1 T49 2.84 ± 3.5 1.73 ± 1.7 1 1 T25 1.74 ± 1.2 2.37 ± 2.4 1 1 T50 2.41 ± 3.0 2.20 ± 2.6 1 1 majority of LIWC features, t-test and/or rank sum test rejected the null hypothesis of no difference between Community and Personal cohorts. There were only six of 68 LIWC features (wps, social, humans, body, time, and filler) whose the null hypothesis were failed to be rejected by either t-test or rank-test or both, at the 0.05 level. Figure 2 shows the relative difference in the use of LIWC features by Community and Personal cohorts. On the affective processes, it is observed that posts in the depression community blogs contained more negative (including anxiety, sadness, and anger ) emotion and less positive words than those in the depression personal blogs. Community blogs also have more affective and negate words, compared with Personal blogs. In addition, they have more pronoun in the posts Textual Cues for Online Depression in Community and Personal Settings 27 Table 4. The prominent topics in favor of Community (number in red) and Personal (number in blue) blogs. Topic T1 Word cloud community diagnosed support issues doctor meds doctors weeks psychiatrist T21 self thoughts anger negative control T40 Word cloud movie movies watch watching film season episode series scene T19 medication star watched story character big characters anime shows total princess evil major hello recently pills anti appointment hair black wear blue white red clothes wearing shirt dress pants looks green bought pink shoes dark dressed fit pair psych anxiety pill effexor medical T45 cry stop crying cried sit hurts reason anxiety therapist bipolar social advice health suicidal illness medication hospital effects prozac medicine started zoloft upset wrong head tears T12 T46 depression mental therapy disorder T49 Topic anymore away wanting handle horrible worse scream break feels sense esteem happiness physically emotionally fear sun water rain snow hair sex phone gone date animal black opposite favourite eyes fire sky feet tree ground weather wind winter light trees warm sea place horse moon dark mind emotions emotional normal lack feels deal mood lately worse worry anymore weeks control completely stress reason current color school movie crush food yep worst kissed T17 feelings favorite emotion sadness normal past T25 problem constantly fine gotten mind handle supposed birthday friday thursday house wednesday tuesday party saturday sunday wait coming sat hopefully dinner monday matt early decided spent plans Fig. 2. The relative difference between Community and Personal blogs in the mean of LIWC features in the posts. The relative difference is in percent, as defined in Eq. 1. Features in red are in the preference of Community category, while those in blue are in the favor of Personal category. (Color figure online) than do their Personal counterparts, except for first personal plural and second personal. This is partly in line with previous findings that depression people tend to use more pronouns [20]. 28 T. Nguyen et al. Fig. 3. The relative difference in the mean of topic proportion for the posts made in Community and Personal blogs. The relative difference is in percent, as defined in Eq. 1. Features in red are in the preference of Community category, while those in blue are in the favor of Personal category. (Color figure online) On the other hand, on the personal concerns, except for death, people are likely to share all experience in all problems (work, achievement, leisure, home, money, and religion) more in personal than in community blogs. On the biological processes, whilst health words were found more in community blogs, all the others (body, sexual, and ingestion) were used more in personal blogs. It shows a focus on the topic discussed within the community – health concerns. On the spoken categories, except fillers which was not found to be significantly different between the two cohorts, assent and non-fluencies were used more in personal than in community blogs. It appears that people could freely post in their own pages with informal and unprepared text. Similarly, swear words were used more in personal than in community pages. It may be because posting to the communities is often gone through a moderation process. So, a post with inappropriate words could be rejected to be posted to the community pages. 3.3 Topical Representation Table 3 shows the mean of the proportion of topics for community and personal blogs, as well as the result of the statistical tests.2 For 47 of 50 topics, t-test 2 All 50 topics learned from the corpus by LDA are placed at http://bit.ly/1KEgjpM. Textual Cues for Online Depression in Community and Personal Settings 29 Table 5. Performance, in terms of the predictive accuracy (percentage of correct predictions), of different classifiers on different feature sets in the binary classifications of Community versus Personal posts. Feature SVM NB LR LIWC 59.6 77.5 78.9 Topic 77.7 69.7 77.9 77.8 Joint features 59.2 74.6 80.3 80.5 68 Lasso and/or rank sum test rejected the null hypothesis of no difference between community and personal cohorts. Figure 3 shows the difference in the distribution of topics for posts made by community and personal cohorts. The topics with the highest difference were presented in Table 4. It is obvious to see in the table that topic 1 with depression themes (“depression”, “anxiety”, “therapist”) was used much more in community than in personal blogs. Likewise, topic 49 with “doctor”, “medicine”, “hospital” was found in the preference of community posts, with the second highest relative difference, in comparison with personal posts. This partly confirms the finding in the language section that community posts were more focused on health concerns than were personal posts. However, it is not straightforward to interpret the meaning of the other three topics in the favor of the community cohort. Nevertheless, “emotional and submissive” sense could be observed across the three topics. They also consisted of many affective words, confirming that community posts contained more affective words than did personal posts. On the other hand, personal cohort favored more on concrete concepts, from “movie” or fashion (“hair”) to weather (“sun”, “rain”) or “birthday”, than did Community cohort. Most of these topics contain words in the LIWC “leisure” feature, which was found to be a favor language category for personal posts. 4 4.1 Classification Performance The Lasso model [9] is used for the classification. Using the coefficients derived from the Lasso method, we implemented a pair-wise classifier of Community versus Personal posts, using three feature sets: LIWC, topics, and a combination of them. The accuracy of this classifier in different feature sets are shown in Table 5, accompanied with that of SVM, Naive Bayes, and Logistic Regression. Lasso outperformed other classifiers when LIWC and a combination of LIWC and topics were used as the features, and was second to LR (77.8 % and 77.9 %), when topics were the features. Classifying based on the derived topics outperformed the LIWC linguistic feature analysis in three of the four classifiers. Noticeably, a fusion of the features 30 T. Nguyen et al. gained the best performance in all the classifications, except for the case when SVM was the classifier. This shows the potential of using multi-cues for mental health prediction. 4.2 Linguistic Features as the Predictors Figure 4a shows the model using language style cues as features to predict community versus personal posts. Several features found significantly different between community vs. personal cohorts (see Sect. 3.2) were chosen into the prediction model. For example, for the affective category, negative, sadness, and anxiety were assigned large positive coefficients, whilst positive emotion had negative coefficient. The positive predictors also included negation, which was found in the preference of community blogs, confirming the finding in Sect. 3.2. Of personal concerns chosen into the model, only death was assigned positive coefficient, whilst the rest (work, leisure, home, money, and religion) had negative coefficients. This is consistent with the findings presented in Sect. 3.2. Similarly, on the biological processes, health was a positive predictor while body, sexual, and ingestion were negative ones. As expected, assent and swear words were also negative predictors in the model. Noticeably, assent had the largest negative coefficient, becoming the most predictive feature of personal posts. 4.3 Topics as the Predictors Figure 4b shows the classification model to predict community versus personal posts using topics as features. Table 6 shows the word cloud of topics whose the weights is largest in the model. It confirms that the top five topics in the preference of community blogs, which were “emotional and submissive” and focused on health concerns, as shown in Table 4, were among the top six positive predictors in the model. On the other hand, fashion and leisure topics, which were in the preference of personal blogs, were assigned negative coefficients. The topic on sex was also a negative predictor, which is in line with the finding on LIWC as the features that sexual, a biological process, was a negative predictor. Similarly, the topic of “eat”, whose words belong to ingestion, another biological process, was also assigned a negative coefficient in the prediction model. 5 Limitation and Further Research In the current work, the mental health status of the individuals in the online communities was not validated. Instead, labeling was “by affiliation” [5]. As such, it cannot be stated that all the individuals whose communications were analysed were experiencing depression. Future studies would benefit from attempting to validate this either by direct contact with the individuals or by analyzing the conversations for admission of diagnosis [4–6,8]. If an admission is identified, other Textual Cues for Online Depression in Community and Personal Settings 31 (a) Lasso model coefficients for prediction of community (versus personal) posts using LIWC features as the predictors. (b) Lasso model coefficients for prediction of community (versus personal) posts using topics as the predictors. Fig. 4. Prediction models of community (versus personal) posts using topics and language styles as features. Features in red are positive predictors of community posts whilst the blues are the negatives. Individual coefficients that were not significant have been omitted, as have topics and LIWC features with no significant coefficients. (Color figure online) 32 T. Nguyen et al. Table 6. Topics with high weights in the prediction of community posts (versus personal ones). Positive weights are in red and negatives are in blue. Topic T8 Word cloud end live kill suicide Topic T14 anymore death dead living point alive killing suicidal dying tried longer attempt hell failure worth thoughts Word cloud pictures watch picture funny camera mike photos big pics youtube ryan T1 mental therapy disorder T21 community diagnosed support issues doctor meds doctors weeks psychiatrist T40 cried sit hurts reason movies watch watching T2 anymore horrible mark david finished michael robert bob jane anger negative control T4 medication pills anti appointment birthday friday normal past completely stress reason constantly fine gotten mind T5 problem handle supposed T28 feelings mind emotions emotional normal lack feels emotion sadness T30 cutting block care stop blood body hurt deep skin arm head scars arms left cuts knife hurts urge away stomach fault away lie real T23 T19 hurt anymore truth cares wrong feelings hurts somebody blame trust matter deserve girls past close broke read date break dating meet hang boy friendship sleep woke asleep stay late fall slept hour oacute automatically iacute asma aacute twitter sin written wrote paper sex woman women hair black ella personality art stories word letter reading words project poetry ideas interesting page men gay children straight marriage wear blue white red clothes tweets daily beta puede protect copied write writing book story books T34 months sexual partner quizilla easily young body gender wearing shirt dress pants looks green bought pink shoes dark dressed fit pair music song T31 bed energy half waking head early band listening songs wanted single couple relationships listen playing rock T10 monday matt early decided spent plans wanted inside cared boyfriend girl relationship girlfriend wait coming sat hopefully dinner writer type female male test T13 world ray alan mrs party saturday sunday thursday house wednesday tuesday sense esteem happiness physically emotionally fear cut pain james peter scott george king frank various richard read psych anxiety pill effexor medical deal mood lately worse self thoughts character big characters anime shows total princess evil worse scream break feels ntilde eacute shipped loudtwitter T12 star watched story yourwintr random jon icon major hello recently T25 hospital movie film season episode series scene away wanting handle effects prozac medicine started zoloft worry anymore weeks control anxiety therapist bipolar social advice health suicidal illness medication cry stop crying upset wrong head tears T49 T46 depression amazing video watching sweet swandive eat weight heard concert sound hear lyrics sing singing eating food fat dinner ate pounds album radio guitar artist pop lose sugar diet lbs exercise body breakfast cup water calories chicken gym wake sleeping lay fell rest eat statements made by that individual can be crawled for analysis. Alternatively, establishing a study and asking to recruit participants via these online communities would allow standardized diagnostic measurements of mental health to be administered. By having a more complete understanding of the mental health of the individuals in these communities, we are able to improve the profiling of their discussion and linguistic features so that a clearer match between online behavior and diagnosis can be made. An example of this approach was conducted in [16], where depression scores of Twitter participants were collected and LIWC features were used as the predictors of the scores. Similarly, the depression ground-truth for Twitter users was Textual Cues for Online Depression in Community and Personal Settings 33 collected individually in [21] (as opposed to “affiliation labeling” in this work), where tweeting behaviors were employed as the predictors. 6 Conclusion We have investigated online depression blogs in community and personal settings. Affect information, language styles, and topics were extracted from the posts made in depression community blogs, as well as those made by members of the depression community in their own personal blogs. Statistical and machine learning techniques were used to discriminate the content in the two settings. A majority of features in all aspects were found to be significantly different between community and personal blogging by members of the community. Also, language styles and topics were found to have strong indicative powers in prediction of depression in personal and community settings. High performance in the classifications illustrates the effectiveness of the content-based cues as the discriminators. The result shows the potential of the new media in early screening and monitoring of online depression individuals and communities. References 1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 2. Bradley, M.M., Lang, P.J.: Affective norms for English words (ANEW): instruction manual and affective ratings. Technical report, University of Florida (1999) 3. Bridge, J.A., Goldstein, T.R., Brent, D.A.: Adolescent suicide and suicidal behavior. J. Child Psychol. Psychiatry 47(3–4), 372–394 (2006) 4. Coppersmith, G., Dredze, M., Harman, C.: Quantifying mental health signals in Twitter. In: Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pp. 51–60 (2014) 5. Coppersmith, G., Dredze, M., Harman, C., Hollingshead, K.: From ADHD to SAD: analyzing the language of mental health on Twitter through self-reported diagnoses. In: Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pp. 1–10 (2015) 6. Coppersmith, G., Harman, C., Dredze, M.: Measuring post traumatic stress disorder in Twitter. In: International AAAI Conference on Weblogs and Social Media, pp. 579–582 (2014) 7. Cruwys, T., Haslam, S.A., Dingle, G.A., Haslam, C., Jetten, J.: Depression and social identity: an integrative review. Pers. Soc. Psychol. Rev. 18(3), 215–238 (2014) 8. De Choudhury, M., Counts, S., Horvitz, E.: Major life changes and behavioral markers in social media: Case of childbirth. In: Proceedings of the 2013 Conference on Computer Supported Cooperative Work, pp. 1431–1442 (2013) 9. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010) 10. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(90001), 5228–5235 (2004) 34 T. Nguyen et al. 11. Mundt, J.C., Vogel, A.P., Feltner, D.E., Lenderking, W.R.: Vocal acoustic biomarkers of depression severity and treatment response. Biol. Psychiatry 72(7), 580–587 (2012) 12. Nguyen, T., Phung, D., Adams, B., Venkatesh, S.: A sentiment-aware approach to community formation in social media. In: Proceedings of the International AAAI Conference on Weblogs and Social Media, pp. 527–530 (2012) 13. Nguyen, T.: Mood patterns and affective lexicon access in weblogs. In: Proceedings of the ACL Student Research Workshop, pp. 43–48 (2010) 14. Nguyen, T., Phung, D., Venkatesh, S.: Analysis of psycholinguistic processes and topics in online autism communities. In: Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 1–6 (2013) 15. Nock, M.K., Green, J.G., Hwang, I., McLaughlin, K.A., Sampson, N.A., Zaslavsky, A.M., Kessler, R.C.: Prevalence, correlates, and treatment of lifetime suicidal behavior among adolescents: Results from the national comorbidity survey replication adolescent supplement. JAMA Psychiatry 70(3), 300–310 (2013) 16. Park, M., Cha, C., Cha, M.: Depressive moods of users portrayed in Twitter. In: Proceedings of the ACM SIGKDD Workshop on Healthcare Informatics, pp. 1–8 (2012) 17. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count (LIWC) [Computer software]. LIWC Inc. (2007) 18. Phung, D., Gupta, S., Nguyen, T., Venkatesh, S.: Connectivity, online social capital, and mood: a Bayesian nonparametric analysis. IEEE Trans. Multimedia 15(6), 1316–1325 (2013) 19. Rodriguez, A.J., Holleran, S.E., Mehl, M.R.: Reading between the lines: the lay assessment of subclinical depression from written self-descriptions. J. Pers. 78(2), 575–598 (2010) 20. Tausczik, Y.R., Pennebaker, J.W.: The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 29(1), 24–54 (2010) 21. Tsugawa, S., Kikuchi, Y., Kishino, F., Nakajima, K., Itoh, Y., Ohsaki, H.: Recognizing depression from Twitter activity. In: Proceedings of the ACM Conference on Human Factors in Computing Systems, pp. 3187–3196 (2015) 22. Waxer, P.H.: Nonverbal cues for depth of depression: set versus no set. J. Consult. Clin. Psychol. 44(3), 493 (1976) Confidence-Weighted Bipartite Ranking Majdi Khalid(B) , Indrakshi Ray, and Hamidreza Chitsaz Colorado State University, Fort Collins, USA majdi.khaled@gmail.com, indrakshi.ray@colostate.edu, chitsaz@chitsazlab.org http://www.colostate.edu Abstract. Bipartite ranking is a fundamental machine learning and data mining problem. It commonly concerns the maximization of the AUC metric. Recently, a number of studies have proposed online bipartite ranking algorithms to learn from massive streams of class-imbalanced data. These methods suggest both linear and kernel-based bipartite ranking algorithms based on first and second-order online learning. Unlike kernelized ranker, linear ranker is more scalable learning algorithm. The existing linear online bipartite ranking algorithms lack either handling non-separable data or constructing adaptive large margin. These limitations yield unreliable bipartite ranking performance. In this work, we propose a linear online confidence-weighted bipartite ranking algorithm (CBR) that adopts soft confidence-weighted learning. The proposed algorithm leverages the same properties of soft confidence-weighted learning in a framework for bipartite ranking. We also develop a diagonal variation of the proposed confidence-weighted bipartite ranking algorithm to deal with high-dimensional data by maintaining only the diagonal elements of the covariance matrix. We empirically evaluate the effectiveness of the proposed algorithms on several benchmark and high-dimensional datasets. The experimental results validate the reliability of the proposed algorithms. The results also show that our algorithms outperform or are at least comparable to the competing online AUC maximization methods. Keywords: Online ranking · Imbalanced learning · AUC maximization 1 Introduction Bipartite ranking is a fundamental machine learning and data mining problem because of its wide range of applications such as recommender systems, information retrieval, and bioinformatics [1,25,29]. Bipartite ranking has also been shown to be an appropriate learning algorithm for imbalanced data [6]. The aim of the bipartite ranking algorithm is to maximize the Area Under the Curve (AUC) by learning a function that scores positive instances higher than negative instances. Therefore, the optimization problem of such a ranking model is formulated as the minimization of a pairwise loss function. This ranking problem can be solved by applying a binary classifier to pairs of positive and negative c Springer International Publishing AG 2016  J. Li et al. (Eds.): ADMA 2016, LNAI 10086, pp. 35–49, 2016. DOI: 10.1007/978-3-319-49586-6 3 36 M. Khalid et al. instances, where the classification function learns to classify a pair as positive or negative based on the first instance in the pair. The key problem of this approach is the high complexity of a learning algorithm that grows quadratically or subquadratically [19] with respect to the number of instances. Recently, significant efforts have been devoted to developing scalable bipartite ranking algorithms to optimize AUC in both batch and online settings [12,15,17,24,37]. Online learning approach is an appealing statistical method because of its scalability and effectivity. The online bipartite ranking algorithms can be classified on the basis of the learned function into linear and nonlinear ranking models. While they are advantageous over linear online ranker in modeling nonlinearity of the data, kernelized online ranking algorithms require a kernel computation for each new training instance. Further, the decision function of the nonlinear kernel ranker depends on support vectors to construct the kernel and to make a decision. Online bipartite ranking algorithms can also be grouped into two different schemes. The first scheme maintains random instances from each class label in a finite buffer [20,37]. Once a new instance is received, the buffer is updated based on stream oblivious policies such as Reservoir Sampling (RS) and First-In-FirstOut (FIFO). Then the ranker function is updated based on a pair of instances, the new instance and each opposite instance stored in the corresponding buffer. These online algorithms are able to deal with non-separable data, but they are based on simple first-order learning. The second approach maintains the first and second statistics [15], and is able to adapt the ranker to the importance of the features [12]. However, these algorithms assume the data are linearly separable. Moreover, these algorithms make no attempt to exploit the confidence information, which has shown to be very effective in ameliorating the classification performance [8,13,35]. Confidence-weighted (CW) learning takes the advantage of the underlying structure between features by modeling the classifier (i.e., the weight vector) as a Gaussian distribution parameterized by a mean vector and covariance matrix [13]. This model captures the notion of confidence for each weight coordinate via the covariance matrix. A large diagonal value corresponding to the i-th feature in the covariance matrix results in less confidence in its weight (i.e., its mean). Therefore, an aggressive update is performed on the less confident weight coordinates. This is analogous to the adaptive subgradient method [14] that involves the geometric structure of the data seen so far in regularizing the weights of sparse features (i.e., less occurring features) as they are deemed more informative than dense features. The confidence-weighted algorithm [13] has also been improved by introducing the adaptive regularization (AROW) that deals with inseparable data [10]. The soft confidenceweighted (SCW) algorithm improves upon AROW by maintaining an adaptive margin [35]. In this paper, we propose a novel framework that solves a linear online bipartite ranking using the soft confidence-weighted algorithm [35]. The proposed confidence-weighted bipartite ranking algorithm (CBR) entertains the fast training and testing phases of linear bipartite ranking. It also enjoys the Confidence-Weighted Bipartite Ranking 37 capability of the soft confidence-weighted algorithm in learning confidenceweighted model, handling linearly inseparable data, and constructing adaptive large margin. The proposed framework follows an online bipartite ranking scheme that maintains a finite buffer for each class label while updating the buffer by one of the stream oblivious policies such as Reservoir Sampling (RS) and FirstIn-First-Out (FIFO) [32]. We also provide a diagonal variation (CBR-diag) of the proposed algorithm to handle high-dimensional datasets. The remainder of the paper is organized as follows. In Sect. 2, we briefly review closely related work. We present the confidence-weighted bipartite ranking (CBR) and its diagonal variation (CBR-diag) in Sect. 3. The experimental results are presented in Sect. 4. Section 5 concludes the paper and presents some future work. 2 Related Work The proposed bipartite ranking algorithm is closely related to the online learning and bipartite ranking algorithms. What follows is a brief review of recent studies related to these topics. Online Learning. The proliferation of big data and massive streams of data emphasize the importance of online learning algorithms. Online learning algorithms have shown a comparable classification performance compared to batch learning, while being more scalable. Some online learning algorithms, such as the Perceptron algorithm [30], the Passive-Aggressive (PA) [7], and the online gradient descent [38], update the model based on a first-order learning approach. These methods do not take into account the underlying structure of the data during learning. This limitation is addressed by exploring second-order information to exploit the underlying structure between features in ameliorating learning performance [4,10,13,14,27,35]. Moreover, kernelized online learning methods have been proposed to deal with nonlinearly distributed data [3,11,28]. Bipartite Ranking. Bipartite ranking learns a real-valued function that induces an order on the data in which the positive instances precede negative instances. The common measure used to evaluate the success of the bipartite ranking algorithm is the AUC [16]. Hence the minimization of the bipartite ranking loss function is equivalent to the maximization of the AUC metric. The AUC presents the probability that a model will rank a randomly drawn positive instance higher than a random negative instance. In batch setting, a considerable amount of studies have investigated the optimization of linear and nonlinear kernel ranking function [5,18,22,23]. More scalable methods are based on stochastic or online learning [31,33,34]. However, these methods are not specifically designed to optimize the AUC metric. Recently, a few studies have focused on the optimization of the AUC metric in online setting. The first approach adopted a framework that maintained a buffer with limited capacity to store random instances to deal with the pairwise loss function [20,37]. The other methods maintained only the first and second statistics for each received instance and 38 M. Khalid et al. optimized the AUC in one pass over the training data [12,15]. The work [12] exploited the second-order technique [14] to make the ranker aware of the importance of less frequently occurring features, hence updating their weights with a higher learning rate. Our proposed method follows the online framework that maintains fixed-sized buffers to store instances from each class label. Further, it exploits the online second-order method [35] to learn a robust bipartite ranking function. This distinguishes the proposed method from [20,37] employing first-order online learning. Also, the proposed method is different from [12,15] by learning a confidenceweighted ranker capable of dealing with non-separable data and learning an adaptive large margin. The most similar approaches to our method are [33,34]. However, these are not designed to directly maximize the AUC. They also use classical first-order and second-order online learning whereas we use the soft variation of confidence-weighted learning that has shown a robust performance in the classification task [35]. 3 3.1 Online Confidence-Weighted Bipartite Ranking Problem Setting We consider a linear online bipartite ranking function that learns on imbal− ∈ Rd |i = anced data to optimize the AUC metric [16]. Let S = {x+ i ∪ xj {1, . . . , n}, j = {1, . . . , m}} denotes the input space of dimension d generated − from unknown distribution D, where x+ i is the i-th positive instance and xj is the j-th negative instance. The n and m denote the number of positive and negative instances, respectively. The linear bipartite ranking function f : S → R is a real valued function that maximizes the AUC metric by minimizing the following loss function: n m 1  − I(f (x+ L(f ; S) = i ) ≤ f (xj )), nm i=1 j=1 where f (x) = wT x and I(·) is the indicator function that outputs 1 if the condition is held, and 0 otherwise. It is common to replace the indicator function with a convex surrogate function, n m 1  − (f (x+ L(f ; S) = i ) − f (xj )), nm i=1 j=1 where (z) = max(0, 1 − z) is the hinge loss function. Therefore, the objective function we seek to minimize with respect to the linear function is defined as follows: n m  1 − ||w||2 + C max(0, 1 − w(x+ i − xj )), 2 i=1 j=1 (1) Confidence-Weighted Bipartite Ranking 39 where ||w|| is the euclidean norm to regularize the ranker, and C is the hyperparameter to penalize the sum of errors. It is easy to see that the optimization problem (1) will grow quadratically with respect to the number of training instances. Following the approach suggested by [37] to deal with the complexity of the pairwise loss function, we reformulate the objective function (1) as a sum of two losses for a pair of instances, T  1 ||w||2 + C I(yt =+1) gt+ (w) + I(yt =−1) gt− (w), 2 t=1 (2) where T = n + m, and gt (w) is defined as follows gt+ (w) = t−1  t =1 gt− (w) = t−1  t =1 I(yt =−1) max(0, 1 − w(xt − xt )), (3) I(yt =+1) max(0, 1 − w(xt − xt )). (4) Instead of maintaining all the received instances to compute the gradients ∇gt (w), we store random instances from each class in the corresponding buffer. Therefore, two buffers B+ and B− with predefined capacity are maintained for positive and negative classes, respectively. The buffers are updated using a stream oblivious policy. The current stored instances in a buffer are used to update the classifier as in Eq. (2) whenever a new instance from the opposite class label is received. The framework of the online confidence-weighted bipartite ranking is shown in Algorithm 1. The two main components of this framework are UpdateBuffer and UpdateRanker, which are explained in the following subsections. 3.2 Update Buffer One effective approach to deal with pairwise learning algorithms is to maintain a buffer with a fixed capacity. This raises the problem of updating the buffer to store the most informative instances. In our online Bipartite ranking framework, we investigate the following two stream oblivious policies to update the buffer: Reservoir Sampling (RS): Reservoir Sampling is a common oblivious policy to deal with streaming data [32]. In this approach, the new instance (xt , yt ) is added to the corresponding buffer if its capacity is not reached, |Byt t | < Myt . If the buffer is at capacity, it will be updated with probability replacing one instance in Byt t Myt |Byt t | by randomly with xt . First-In-First-Out (FIFO): This simple strategy replaces the oldest instance with the new instance if the corresponding buffer reaches its capacity. Otherwise, the new instance is simply added to the buffer. 40 M. Khalid et al. Algorithm 1. A Framework for Confidence-Weighted Bipartite Ranking (CBR) Input: • • • • the penalty parameter C the capacity of the buffers M+ and M− η parameter ai = 1 for i ∈ 1, . . . , d Initialize: μ1 = {0, . . . , 0}d , B+ = B− = ∅ Σ1 = diag(a) or G1 = a for t = 1, . . . , T do Receive a training instance (xt , yt ) if yt = +1 then t+1 t = B− B− t Ct = C max(1, |B− |/M− ) t+1 t = UpdateBuffer(xt , B+ , M+ ) B+ t+1 , η) [μt+1 , Σt+1 ] = UpdateRanker(μt , Σt , xt , yt , Ct , B− t+1 , η) [μt+1 , Gt+1 ] = UpdateRanker(μt , Gt , xt , yt , Ct , B− else t+1 t = B+ B+ t Ct = C max(1, |B+ |/M+ ) t+1 t , M− ) B− = UpdateBuffer(xt , B− t+1 , η) [μt+1 , Σt+1 ] = UpdateRanker(μt , Σt , xt , yt , Ct , B+ t+1 , η) [μt+1 , Gt+1 ] = UpdateRanker(μt , Gt , xt , yt , Ct , B+ end if end for 3.3 or (CBR-diag) or (CBR-diag) Update Ranker Inspired by the robust performance of second-order learning algorithms, we apply the soft confidence-weighted learning approach [35] to updated the bipartite ranking function. Therefore, our confidence-weighted bipartite ranking model (CBR) is formulated as a ranker with a Gaussian distribution parameterized by mean vector μ ∈ Rd and covariance matrix Σ ∈ Rd×d . The mean vector μ represents the model of the bipartite ranking function, while the covariance matrix captures the confidence in the model. The ranker is more confident about the model value μp as its diagonal value Σp,p is smaller. The model distribution is updated once the new instance is received while being close to the old model distribution. This optimization problem is performed by minimizing the Kullback-Leibler divergence between the new and the old distributions of the model. The online confidence-weighted bipartite ranking (CBR) is formulated as follows: Confidence-Weighted Bipartite Ranking (μt+1 , Σt+1 ) = argminDKL (N (μ, Σ)||N (μt , Σt )) μ,Σ 41 (5) + Cφ (N (μ, Σ); (z, yt )), where z = (xt − x), C is the the penalty hyperparamter, φ = Φ−1 (η), and Φ is the normal cumulative distribution function. The loss function φ (·) is defined as: √ φ (N (μ, Σ); (z, yt )) = max(0, φ z T Σz − yt μ · z). The solution of (5) is given by the following proposition. Proposition 1. The optimization problem (5) has a closed-form solution as follows: μt+1 = μt + αt yt Σt z, Σt+1 = Σt − βt Σt z T zΣt . The coefficients α and β are defined as follows:  4 αt = min{C, max{0, υ1t ζ (−mt ψ + m2t φ4 + υt φ2 ζ)}}, βt =  1 √ αt φ ut +υt αt φ . where ut = 4 (−αt υt φ + z T Σt z, mt = yt (μt · z) , φ = Φ−1 (η), υt = z = xt − x. αt2 υt2 φ2 + 4υt )2 , ψ =1+ φ2 2 , ζ = 1 + φ2 , and The proposition 1 is analogous to the one derived in [35]. Though modeling the full covariance matrix lends the CW algorithms a powerful capability in learning [9,26,35], it raises potential concerns with highdimensional data. The covariance matrix grows quadratically with respect to the data dimension. This makes the CBR algorithm impractical with highdimensional data due to high computational and memory requirements. We remedy this deficiency by a diagonalization technique [9,14]. Therefore, we present a diagonal confidence-weighted bipartite ranking (CBR-diag) that models the ranker as a mean vector μ ∈ Rd and diagonal matrix Σ̂ ∈ Rd×d . Let G denotes diag(Σ̂), and the optimization problem of CBR-diag is formulated as follows: (μt+1 , Gt+1 ) = argminDKL (N (μ, G)||N (μt , Gt )) μ,G +Cφ (N (μ, G); (z, yt )). (6) 42 M. Khalid et al. Proposition 2. The optimization problem (6) has a closed-form solution as follows: μt+1 = μt + αt yt z , Gt Gt+1 = Gt + βt z 2 . The coefficients α and β are defined as follows  4 αt = min{C, max{0, υ1t ζ (−mt ψ + m2t φ4 + υt φ2 ζ)}}, βt = √ αt φ ut +υt αt φ , where ut = 14 (−αt υt φ + −1 φ=Φ (η), ψ = 1 +  φ2 2 , αt2 υt2 φ2 + 4υt )2 , υt = d zi2 i=1 Gi +C , mt = yt (μt · z), ζ = 1 + φ , and z = xt − x. 2 The Propositions 1 and 2 can be proved similarly to the proof in [35]. The steps of updating the online confidence-weighted bipartite ranking with full covariance matrix or with the diagonal elements are summarized in Algorithm 2. Algorithm 2. Update Ranker Input: • • • • • • : current mean vector μt Σt or Gt : current covariance matrix or diagonal elements (xt , yt ) : a training instance B : the buffer storing instances from the opposite class label : class-specific weighting parameter Ct η : the predefined probability Output: updated ranker: • μt+1 • Σt+1 or Gt+1 Initialize: μ1 = μt , (Σ 1 = Σt or G1 = Gt ), i = 1 for x ∈ B do Update the ranker (μi , Σ i ) with z = xt − x and yt by (μi+1 , Σ i+1 ) = argminDKL (N (μ, Σ)||N (μi , Σ i )) + Cφ (N (μ, Σ); (z, yt )) μ,Σ or Update the ranker (μi , Gi ) with z = xt − x and yt by (μi+1 , Gi+1 ) = argminDKL (N (μ, G)||N (μi , Gi )) + Cφ (N (μ, G); (z, yt )) μ,G i=i+1 end for Return μt+1 = μ|B|+1 Σt+1 = Σ |B|+1 or Gt+1 = G|B|+1 Confidence-Weighted Bipartite Ranking 4 43 Experimental Results In this section, we conduct extensive experiments on several real world datasets in order to demonstrate the effectiveness of the proposed algorithms. We also compare the performance of our methods with existing online learning algorithms in terms of AUC and classification accuracy at the optimal operating point of the ROC curve (OPTROC). The running time comparison is also presented. 4.1 Real World Datasets We conduct extensive experiments on various benchmark and high-dimensional datasets. All datasets can be downloaded from LibSVM1 and the machine learning repository UCI2 except the Reuters3 dataset that is used in [2]. If the data are provided as training and test sets, we combine them together in one set. For cod-rna data, only the training and validation sets are grouped together. For rcv1 and news20, we only use their training sets in our experiments. The multi-class datasets are transformed randomly into class-imbalanced binary datasets. Tables 1 and 2 show the characteristics of the benchmark and the high-dimensional datasets, respectively. Table 1. Benchmark datasets Data #inst #feat Data #inst glass 214 10 cod-rna 331,152 ionosphere 351 34 spambase 4,601 german 1,000 24 #feat Data 8 57 diabetes 768 acoustic 78,823 50 8 svmguide4 612 10 magic04 19,020 11 vehicle 846 18 svmguide3 1284 21 heart 270 13 segment 2,310 19 farm-ads rcv1 sector 3 14 581,012 54 Data 2 #feat covtype Table 2. High-dimensional datasets 1 #inst australian 690 #inst #feat 4,143 54,877 15,564 47,236 9,619 55,197 real-sim 72,309 20,958 news20 15,937 62,061 Reuters 8,293 18,933 https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/. https://archive.ics.uci.edu/ml/. http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html. 44 4.2 M. Khalid et al. Compared Methods and Model Selection Online Uni-Exp [21]: An online pointwise ranking algorithm that optimizes the weighted univariate exponential loss. The learning rate is tuned by 2-fold cross validation on the training set by searching in 2[−10:10] . OPAUC [15]: An online learning algorithm that optimizes the AUC in one-pass through square loss function. The learning rate is tuned by 2-fold cross validation by searching in 2[−10:10] , and the regularization hyperparameter is set to a small value 0.0001. OPAUCr [15]: A variation of OPAUC that approximates the covariance matrices using low-rank matrices. The model selection step is carried out similarly to OPAUC, while the value of rank τ is set to 50 as suggested in [15]. OAMseq [37]: The online AUC maximization (OAM) is the state-of-the-art firstorder learning method. We implement the algorithm with the Reservoir Sampling as a buffer updating scheme. The size of the positive and negative buffers is fixed at 50. The penalty hyperparameter C is tuned by 2-fold cross validation on the training set by searching in 2[−10:10] . AdaOAM [12]: This is a second-order AUC maximization method that adapts the classifier to the importance of features. The smooth hyperparameter δ is set to 0.5, and the regularization hyperparameter is set to 0.0001. The learning rate is tuned by 2-fold cross validation on the training set by searching in 2[−10:10] . CBRRS and CBRFIFO : The proposed confidence-weighted bipartite ranking algorithms with the Reservoir Sampling and First-In-First-Out buffer updating policies, respectively. The size of the positive and negative buffers is fixed at 50. The hyperparameter η is set to 0.7, and the penalty hyperparameter C is tuned by 2-fold cross validation by searching in 2[−10:10] . CBR-diagFIFO : The proposed diagonal variation of confidence-weighted bipartite ranking that uses the First-In-First-Out policy to update the buffer. The buffers are set to 50, and the hyperparameters are tuned similarly to CBRFIFO . For a fair comparison, the datasets are scaled similarly in all experiments. We randomly divide each dataset into 5 folds, where 4 folds are used for training and one fold is used as a test set. For benchmark datasets, we randomly choose 8000 instances if the data exceeds this size. For high-dimensional datasets, we limit the sample size of the data to 2000 due to the high dimensionality of the data. The results on the benchmark and the high-dimensional datasets are averaged over 10 and 5 runs, respectively. A random permutation is performed on the datasets with each run. All experiments are conducted with Matlab 15 on a workstation computer with 8 × 2.6 G CPU and 32 GB memory. Confidence-Weighted Bipartite Ranking 4.3 45 Results on Benchmark Datasets The comparison in terms of AUC is shown in Table 3, while the comparison in terms of classification accuracy at OPTROC is shown in Table 4. The running time (in milliseconds) comparison is illustrated in Fig. 1. The results show the robust performance of the proposed methods CBRRS and CBRF IF O in terms of AUC and classification accuracy compared to other first and second-order online learning algorithms. We can observe that the improvement of the second-order methods such as OPAUC and AdaOAM over first-order method OAMseq is not reliable, while our CBR algorithms often outperform the OAMseq . Also, the proposed methods are faster than OAMseq , while they incur more running time compared to AdaOAM except with spambase, covtype, and acoustic datasets. The pointwise method online Uni-Exp maintains fastest running time, but at the expense of the AUC and classification accuracy. We also notice that the performance of CBRFIFO is often slightly better than CBRRS in terms of AUC, classification accuracy, and running time. Table 3. Comparison of AUC performance on benchmark datasets Data CBRRS 0.825 ± 0.043 glass ionosphere 0.950 ± 0.027 0.782 ± 0.024 german CBRFIFO Online Uni-Exp OPAUC OAMseq AdaOA 0.823 ± 0.046 0.714 ± 0.075 0.798 ± 0.061 0.805 ± 0.047 0.794 ± 0.061 0.913 ± 0.018 0.943 ± 0.026 0.946 ± 0.025 0.943 ± 0.029 0.780 ± 0.019 0.702 ± 0.032 0.736 ± 0.034 0.731 ± 0.028 0.770 ± 0.024 0.761 ± 0.053 0.951 ± 0.028 svmguide4 0.969 ± 0.013 0.974 ± 0.013 0.609 ± 0.096 0.733 ± 0.056 0.771 ± 0.063 svmguide3 0.755 ± 0.022 0.764 ± 0.036 0.701 ± 0.025 0.737 ± 0.029 0.705 ± 0.033 0.738 ± 0.033 cod-rna 0.983 ± 0.000 0.984 ± 0.000 0.928 ± 0.000 0.927 ± 0.001 0.951 ± 0.025 0.927 ± 0.000 spambase 0.941 ± 0.006 0.942 ± 0.006 0.866 ± 0.016 0.849 ± 0.020 0.897 ± 0.043 0.862 ± 0.011 covtype 0.816 ± 0.003 0.835 ± 0.001 0.705 ± 0.033 0.711 ± 0.041 0.737 ± 0.023 0.770 ± 0.010 magic04 0.799 ± 0.006 0.801 ± 0.006 0.759 ± 0.006 0.748 ± 0.033 0.757 ± 0.015 0.773 ± 0.006 heart 0.908 ± 0.019 0.909 ± 0.021 0.733 ± 0.039 0.788 ± 0.054 0.806 ± 0.059 0.799 ± 0.079 australian 0.883 ± 0.028 0.889 ± 0.019 0.710 ± 0.130 0.735 ± 0.138 0.765 ± 0.107 0.801 ± 0.037 diabetes 0.700 ± 0.021 0.707 ± 0.033 0.633 ± 0.036 0.667 ± 0.041 0.648 ± 0.040 0.675 ± 0.034 acoustic 0. 879 ± 0.006 0.892 ± 0.003 0.876 ± 0.003 0.878 ± 0.003 0.863 ± 0.011 0.882 ± 0.003 vehicle 0.846 ± 0.031 0.846 ± 0.034 0.711 ± 0.053 0.764 ± 0.073 0.761 ± 0.078 0.792 ± 0.049 segment 0.900 ± 0.013 0.689 ± 0.061 0.828 ± 0.024 0.812 ± 0.035 0.855 ± 0.008 4.4 0.903 ± 0.008 Results on High-Dimensional Datasets We study the performance of the proposed CBR-diagF IF O and compare it with online Uni-Exp, OPAUCr, and OAMseq that avoid constructing the full covariance matrix. Table 5 compares our method and the other online algorithms in terms of AUC, while Table 6 shows the classification accuracy at OPTROC. Figure 2 displays the running time (in milliseconds) comparison. The results show that the proposed method CBR-diagF IF O yields a better performance on both measures. We observe that the CBR-diagF IF O presents a competitive running time compared to its counterpart OAMseq as shown in Fig. 2. We can also see that the CBR-diagF IF O takes more running time compared to the OPAUCr. However, the CBR-diagF IF O achieves better AUC and 46 M. Khalid et al. Table 4. Comparison of classification accuracy at OPTROC on benchmark datasets Data CBRRS CBRFIFO Online Uni-Exp OPAUC OAMseq AdaOAM glass 0.813 ± 0.044 0.811 ± 0.049 0.732 ± 0.060 0.795 ± 0.046 0.788 ± 0.040 0.783 ± 0.047 ionosphere 0.946 ± 0.028 0.946 ± 0.022 0.902 ± 0.028 0.936 ± 0.018 0.943 ± 0.017 0.938 ± 0.018 german 0.780 ± 0.022 0.787 ± 0.019 0.741 ± 0.027 0.754 ± 0.022 0.751 ± 0.028 0.770 ± 0.030 svmguide4 0.951 ± 0.014 0.956 ± 0.012 0.829 ± 0.021 0.843 ± 0.024 0.839 ± 0.022 0.848 ± 0.020 svmguide3 0.784 ± 0.015 0.793 ± 0.016 0.784 ± 0.019 0.777 ± 0.024 0.780 ± 0.020 0.777 ± 0.024 cod-rna 0.948 ± 0.002 0.949 ± 0.000 0.887 ± 0.001 0.887 ± 0.001 0.910 ± 0.019 0.887 ± 0.001 0.898 ± 0.009 0.818 ± 0.019 0.795 ± 0.022 0.849 ± 0.053 0.809 ± 0.014 0.766 ± 0.003 0.672 ± 0.018 0.674 ± 0.021 0.685 ± 0.016 0.709 ± 0.008 0.899 ± 0.009 spambase 0.746 ± 0.005 covtype magic04 0.769 ± 0.011 0.771 ± 0.006 0.734 ± 0.007 0.731 ± 0.015 0.736 ± 0.013 0.752 ± 0.008 heart 0.883 ± 0.032 0.875 ± 0.026 0.716 ± 0.021 0.753 ± 0.038 0.777 ± 0.043 0.772 ± 0.053 0.711 ± 0.056 0.725 ± 0.070 0.742 ± 0.064 0.768 ± 0.036 0.683 ± 0.037 0.692 ± 0.040 0.694 ± 0.044 0.689 ± 0.040 australian 0.841 ± 0.023 diabetes 0.714 ± 0.029 acoustic 0. 844 ± 0.005 0.850 ± 0.003 0.840 ± 0.005 0.839 ± 0.002 0.832 ± 0.005 0.841 ± 0.003 vehicle 0.816 ± 0.018 0.814 ± 0.018 0.764 ± 0.027 0.797 ± 0.014 0.790 ± 0.029 0.805 ± 0.021 segment 0.838 ± 0.015 0.836 ± 0.008 0.691 ± 0.031 0.768 ± 0.027 0.755 ± 0.024 0.796 ± 0.014 10 4 10 3 10 2 10 1 10 0 0.842 ± 0.022 0.705 ± 0.032 running time (milliseconds) Online Uni-Exp OPAUC OAMseq 10 AdaOAM CBRRS CBRFIFO -1 glass ionosphere german svmguide4 svmguide3 cod-rna spambase covtype magic04 heart australian diabetes acoustic vehicle segment datasets Fig. 1. Running time (in milliseconds) of CBR and the other online learning algorithms on the benchmark datasets. The y-axis is displayed in log- scale. classification accuracy compared to the OPAUCr. The online Uni-Exp algorithm requires the least running time, but it presents lower AUC and classification accuracy compared to our method. Table 5. Comparison of AUC on high-dimensional datasets Data CBR-diagFIFO Online Uni-Exp OPAUCr OAMseq farm-ads 0.961 ± 0.004 0.942 ± 0.006 0.951 ± 0.004 0.952 ± 0.005 0.950 ± 0.007 0.927 ± 0.015 0.914 ± 0.016 0.945 ± 0.008 rcv1 sector 0.927 ± 0.009 0.846 ± 0.019 0.908 ± 0.013 0.857 ± 0.008 real-sim 0.982 ± 0.001 0.969 ± 0.003 0.975 ± 0.002 0.977 ± 0.001 news20 0.956 ± 0.003 0.939 ± 0.005 0.942 ± 0.006 0.944 ± 0.005 Reuters 0.993 ± 0.001 0.985 ± 0.003 0.988 ± 0.002 0.989 ± 0.003 Confidence-Weighted Bipartite Ranking 47 Table 6. Comparison of classification accuracy at OPTROC on high-dimensional datasets Data CBR-diagFIFO Online Uni-Exp OPAUCr OAMseq farm-ads 0.897 ± 0.007 0.872 ± 0.012 0.885 ± 0.008 0.882 ± 0.007 0.971 ± 0.001 0.967 ± 0.002 0.966 ± 0.003 0.970 ± 0.001 rcv1 sector 0.850 ± 0.012 0.772 ± 0.011 0.831 ± 0.015 0.776 ± 0.008 real-sim 0.939 ± 0.003 0.913 ± 0.005 0.926 ± 0.002 0.929 ± 0.001 news20 0.918 ± 0.005 0.895 ± 0.005 0.902 ± 0.009 0.907 ± 0.006 Reuters 0.971 ± 0.004 0.953 ± 0.006 0.961 ± 0.006 0.961 ± 0.006 10 6 Online Uni-Exp OPAUCr OAMseq running time (milliseconds) CBR-diag 10 5 10 4 10 3 10 2 farm-ads FIFO rcv1 sector real-sim news20 Reuters datasets Fig. 2. Running time (in milliseconds) of CBR-diagF IF O algorithm and the other online learning algorithms on the high-dimensional datasets. The y-axis is dis- played in log-scale. 5 Conclusions and Future Work In this paper, we proposed a linear online soft confidence-weighted bipartite ranking algorithm that maximizes the AUC metric via optimizing a pairwise loss function. The complexity of the pairwise loss function is mitigated in our algorithm by employing a finite buffer that is updated using one of the stream oblivious policies. We also develop a diagonal variation of the proposed confidence-weighted bipartite ranking algorithm to deal with high-dimensional data by maintaining only the diagonal elements of the covariance matrix instead of the full covariance matrix. The experimental results on several benchmark and high-dimensional datasets show that our algorithms yield a robust performance. The results also show that the proposed algorithms outperform the first and second-order AUC maximization methods on most of the datasets. As future work, we plan to conduct a theoretical analysis of the proposed method. We also aim to investigate the use of online feature selection [36] within our proposed framework to effectively handle high-dimensional data. 48 M. Khalid et al. References 1. Agarwal, S.: A study of the bipartite ranking problem in machine learning (2005) 2. Cai, D., He, X., Han, J.: Locally consistent concept factorization for document clustering. IEEE Trans. Knowl. Data Eng. 23(6), 902–913 (2011) 3. Cavallanti, G., Cesa-Bianchi, N., Gentile, C.: Tracking the best hyperplane with a simple budget perceptron. Mach. Learn. 69(2–3), 143–167 (2007) 4. Cesa-Bianchi, N., Conconi, A., Gentile, C.: A second-order perceptron algorithm. SIAM J. Comput. 34(3), 640–668 (2005) 5. Chapelle, O., Keerthi, S.S.: Efficient algorithms for ranking with svms. Inf. Retrieval 13(3), 201–215 (2010) 6. Cortes, C., Mohri, M.: Auc optimization vs. error rate minimization. In: Advances in Neural Information Processing Systems 16(16), pp. 313–320 (2004) 7. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passiveaggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006) 8. Crammer, K., Dredze, M., Kulesza, A.: Multi-class confidence weighted algorithms. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, pp. 496–504. Association for Computational Linguistics (2009) 9. Crammer, K., Dredze, M., Pereira, F.: Confidence-weighted linear classification for text categorization. J. Mach. Learn. Res. 13(1), 1891–1926 (2012) 10. Crammer, K., Kulesza, A., Dredze, M.: Adaptive regularization of weight vectors. In: Advances in Neural Information Processing Systems, pp. 414–422 (2009) 11. Dekel, O., Shalev-Shwartz, S., Singer, Y.: The forgetron: a kernel-based perceptron on a budget. SIAM J. Comput. 37(5), 1342–1372 (2008) 12. Ding, Y., Zhao, P., Hoi, S.C., Ong, Y.S.: An adaptive gradient method for online auc maximization. In: AAAI, pp. 2568–2574 (2015) 13. Dredze, M., Crammer, K., Pereira, F.: Confidence-weighted linear classification. In: Proceedings of the 25th International Conference on Machine Learning, pp. 264–271. ACM (2008) 14. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011) 15. Gao, W., Jin, R., Zhu, S., Zhou, Z.H.: One-pass auc optimization. In: ICML (3), pp. 906–914 (2013) 16. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1), 29–36 (1982) 17. Hu, J., Yang, H., King, I., Lyu, M.R., So, A.M.C.: Kernelized online imbalanced learning with fixed budgets. In: AAAI, pp. 2666–2672 (2015) 18. Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 377–384. ACM (2005) 19. Joachims, T.: Training linear svms in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226. ACM (2006) 20. Kar, P., Sriperumbudur, B.K., Jain, P., Karnick, H.: On the generalization ability of online learning algorithms for pairwise loss functions. In: ICML (3), pp. 441–449 (2013) 21. Kotlowski, W., Dembczynski, K.J., Huellermeier, E.: Bipartite ranking through minimization of univariate loss. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 1113–1120 (2011) Confidence-Weighted Bipartite Ranking 49 22. Kuo, T.M., Lee, C.P., Lin, C.J.: Large-scale kernel ranksvm. In: SDM, pp. 812–820. SIAM (2014) 23. Lee, C.P., Lin, C.B.: Large-scale linear ranksvm. Neural Comput. 26(4), 781–817 (2014) 24. Li, N., Jin, R., Zhou, Z.H.: Top rank optimization in linear time. In: Advances in Neural Information Processing Systems, pp. 1502–1510 (2014) 25. Liu, T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retrieval 3(3), 225–331 (2009) 26. Ma, J., Kulesza, A., Dredze, M., Crammer, K., Saul, L.K., Pereira, F.: Exploiting feature covariance in high-dimensional online learning. In: International Conference on Artificial Intelligence and Statistics, pp. 493–500 (2010) 27. Orabona, F., Crammer, K.: New adaptive algorithms for online classification. In: Advances in Neural Information Processing Systems, pp. 1840–1848 (2010) 28. Orabona, F., Keshet, J., Caputo, B.: Bounded kernel-based online learning. J. Mach. Learn. Res. 10, 2643–2666 (2009) 29. Rendle, S., Balby Marinho, L., Nanopoulos, A., Schmidt-Thieme, L.: Learning optimal ranking with tensor factorization for tag recommendation. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 727–736. ACM (2009) 30. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958) 31. Sculley, D.: Large scale learning to rank. In: NIPS Workshop on Advances in Ranking, pp. 1–6 (2009) 32. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985) 33. Wan, J., Wu, P., Hoi, S.C., Zhao, P., Gao, X., Wang, D., Zhang, Y., Li, J.: Online learning to rank for content-based image retrieval. In: Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, IJCAI, pp. 2284– 2290 (2015) 34. Wang, J., Wan, J., Zhang, Y., Hoi, S.C.: Solar: Scalable online learning algorithms for ranking 35. Wang, J., Zhao, P., Hoi, S.C.: Exact soft confidence-weighted learning. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 121–128 (2012) 36. Wang, J., Zhao, P., Hoi, S.C., Jin, R.: Online feature selection and its applications. IEEE Trans. Knowl. Data Eng. 26(3), 698–710 (2014) 37. Zhao, P., Jin, R., Yang, T., Hoi, S.C.: Online auc maximization. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 233–240 (2011) 38. Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent (2003) Mining Distinguishing Customer Focus Sets for Online Shopping Decision Support Lu Liu1 , Lei Duan1,2(B) , Hao Yang1 , Jyrki Nummenmaa3,4 , Guozhu Dong5 , and Pan Qin1 1 2 3 School of Computer Science, Sichuan University, Chengdu, China liulu1221@stu.scu.edu.cn, leiduan@scu.edu.cn, {hyang.cn,panqin.cn}@outlook.com West China School of Public Health, Sichuan University, Chengdu, China School of Information Sciences, University of Tampere, Tampere, Finland jyrki.nummenmaa@uta.fi 4 Sino-Finnish Centre, Tongji University, Shanghai, China 5 Department of Computer Science and Engineering, Wright State University, Dayton, USA guozhu.dong@wright.edu Abstract. With the development of e-commerce, online shopping becomes increasingly popular. Very often, online shopping customers read reviews written by other customers to compare similar items. However, the number of customer reviews is typically too large to look through in a reasonable amount of time. To extract information that can be used for online shopping decision support, this paper investigates a novel data mining problem of mining distinguishing customer focus sets from customer reviews. We demonstrate that this problem has many applications, and at the same time, is challenging. We present dFocusMiner, a mining method with various techniques that makes the mined results interpretable and user-friendly. Our experimental results on real world data sets verify the effectiveness and efficiency of our method. Keywords: Distinguishing support · Data mining 1 customer focus · Shopping decision Introduction As e-commerce provides a convenient and flexible way to purchase consumer products at prices that are often cheaper than purchases from traditional stores, more and more people now prefer online shopping to in-store shopping. Different from traditional in-store shopping where customers seldom exchange their opinions about the items they bought, most e-commerce web sites provide a platform This work was supported in part by NSFC 61572332, the Fundamental Research Funds for the Central Universities 2016SCU04A22, and the China Postdoctoral Science Foundation 2014M552371, 2016T90850. c Springer International Publishing AG 2016  J. Li et al. (Eds.): ADMA 2016, LNAI 10086, pp. 50–64, 2016. DOI: 10.1007/978-3-319-49586-6 4 Mining Distinguishing Customer Focus Sets for Online Shopping 51 for customers to post their reviews on those items. As Ghose and Ipeirotis [1] pointed out, the opinions expressed in online reviews are important factors that impact the sale of the items being reviewed. Typically, customer reviews can – provide more user-oriented descriptions of various items. For example, instead of presenting the exact weight of a laptop in kilograms, a customer may state that “it is convenient to travel with it”. – affect future customers’ shopping decisions. For example, products with good word-of-mouth tend to have bigger sales. As a result, most online shopping websites encourage customers to share their assessments on the shopping items they bought. Table 1 lists the number of customer reviews on Apple iPhone6 and Samsung Note4, the two most popular smart phones, at Amazon and Best Buy (two famous online shopping websites in North America) as well as at JD, Tmall and Suning (three famous online shopping websites in China). Table 1. Number of customer reviews on Apple iPhone6 and Samsung Note4 Product Number of customer reviews Amazon Best Buy JD Tmall Suning iPhone6 1620 Note4 1118 1466 992 370050 32551 51837 48903 1150 3653 From Table 1, we can see that (1) customer reviews are available at many popular online shopping websites, and (2) the number of customer reviews for a hot product is huge. Clearly, it is time consuming for customers to look through hundreds or thousands of reviews. In addition, for two similar shopping items, such as two smart phones or two hotels at the same scenic spot, the customer reviews can be quite similar. For example, the most frequent customer reviews on both Apple iPhone6 and Samsung Note4 talk about “ringtones large”, “nice look ”, “quick response”, etc. Intuitively, customers are more interested in the reviews describing distinguishing aspects, rather than the common aspects, on the items under comparison. For example, “big screen” is a comment that frequently occurs in the reviews of Note4 but infrequently occurs in those of iPhone6. This comment can be influential in helping customers who like big screen to select Note4. Based on the above, online shopping websites should provide more distinguishing information when comparing reviews on different items to help customers make shopping decisions. We note that existing recommendation methods are lacking on addressing the above important needs. This leads us to a novel data mining problem. We say that a customer description of a certain aspect of a given shopping item is a customer focus on the item; examples include “big screen”, “nice look ”. Given two shopping items, a 52 L. Liu et al. distinguishing customer focus set is a set of customer focuses that frequently occur in the reviews of one of the shopping item but infrequently occur in the reviews of the other shopping item. For example, “big screen”, “android system” are customer focuses that frequently occur in the reviews of Note4 but infrequently occur in the reviews of iPhone6 ; we call such a phrase set a distinguishing customer focus set. Mining distinguishing customer focus sets is an interesting problem with many useful applications. On one hand, it can provide more information to support customers’ shopping decision making. On the other hand, it is possible for product manufacturers to use the distinguishing strong points and weak points of their products, mined from customers reviews of their products compared against reviews of their rivals’ products. To tackle the problem of mining distinguishing customer focus sets, we need to address several technical challenges. First, we need to have a comprehensive yet complete way to represent customer focuses. Naturally, a customer focus should not only reflect the opinions of customers but should also be easy for customers to understand. In addition, considering a variety of expressions can arise for different products, different time, and different customers, and the total set of customer focuses should not be predefined. Second, we need to have an effective way to evaluate the similarities among candidate customer focuses, so that redundant customer focuses can be removed, the mining results can be more concise, and the algorithm can be more efficient. Third, we also need to find effective techniques to efficiently discover top distinguishing customer focus sets. This paper makes the following main contributions to mining distinguishing customer focus sets for shopping decision support: (1) introducing a novel data mining problem of top-k distinguishing customer focus set mining; (2) designing an effective yet flexible method for distinguishing customer focus set mining; (3) conducting extensive experiments using the online reviews from various types of e-commerce websites, to evaluate our distinguishing customer focus set mining method, and to demonstrate that the proposed method can find interesting customer focus sets and it is effective and efficient. The rest of the paper is organized as follows. We formulate the problem of distinguishing customer focus set mining in Sect. 2, and review related work in Sect. 3. In Sect. 4, we discuss the critical techniques of our method (called dFocus-Miner). We report a systematic empirical study in Sect. 5, and conclude the paper in Sect. 6. 2 Problem Definition We start with some preliminaries. Given a shopping item q, let R(q) be the set of customer reviews of q in a website. A customer focus is a description of a certain aspect of a given shopping item from the users’ perspective. For example, in the customer review “At the business center of the hotel we were treated extremely coldly and arrogantly”, the customer focus is about the “service” of the hotel’s business center; we consider “treated extremely coldly and arrogantly” as its customer focus. Mining Distinguishing Customer Focus Sets for Online Shopping 53 Given a customer review r ∈ R(q) and a customer focus f , if f is “semanti˙ For examcally” contained by r, then we say that r supports f , denoted by f r. ple, the review “We enjoyed the nightly turn down service as well as an exceptionally clean room and bathroom” supports the customer focus “neat room” since “neat” is semantically similar with “clean”. A customer focus set f s is a set of customer focuses. The support of f s on a shopping item q, denoted by Sup(f s, q), is defined by Eq. 1. Sup(f s, q) = ˙ |{r ∈ R(q) | ∀f ∈ f s, f r}| |R(q)| (1) Definition 1. Given a customer focus set f s and two shopping items q1 and q2 , the contrast score of f s from q2 to q1 , denoted by cScore(f s, q1 , q2 ), is cScore(f s, q1 , q2 ) = Sup(f s, q1 ) − Sup(f s, q2 ) (2) To select top-k distinguishing customer focus sets, we now define a total order on all customer focus sets. Definition 2. Given two shopping items, q1 and q2 , and two customer focus sets, f s and f s , we say f s has a higher precedence than f s (or f s precedes f s ) if: 1. cScore(f s, q1 , q2 ) > cScore(f s , q1 , q2 ), or 2. cScore(f s, q1 , q2 ) = cScore(f s , q1 , q2 ), but Sup(f s, q1 ) > Sup(f s , q1 ), 3. cScore(f s, q1 , q2 ) = cScore(f s , q1 , q2 ) and Sup(f s, q1 ) = Sup(f s , q1 ), but f s is lexically smaller than f s . Definition 3. Given an integer k, a query item q1 , and a reference query item q2 , the problem of mining top-k distinguishing customer focus sets is to find the customer focus sets with top-k precedence from R(q1 ) and R(q2 ). 3 Related Work Our study is related to previous studies on recommender systems, opinion mining, and contrast mining. 3.1 Recommendation Most studies on recommendation employ model-based collaborative filtering techniques for ranking [2,3]. Recently, location-based social networks are widely used in POI recommendation [4], which is related to user behavior study [5], event and activity recommendations [6] and online retail store placement [7]. There are several studies exploring customer reviews for prediction [8,9]. Both rating information and user preference can be integrated into review-based recommender systems [8]. Other studies include [10,11]. Different from studies on recommendation, the target of distinguishing customer focus set mining is providing more information for shopping decision support rather than recommending shopping items. Thus, methods for recommendation cannot be applied to our problem directly. 54 3.2 L. Liu et al. Opinion Mining Opinion mining (also called sentiment analysis) tries to identify customers’ opinions by applying NLP techniques to extract opinions related noun words and adjectives words. Technically, several approaches, including association-rule based [12] and model based [13], have been proposed to extract opinion words and opinion targets. The mining of user opinions is also useful to improve the performance of recommender systems [9,14]. A customer focus reflects the customer’s assessments on certain aspects of a shopping item. Conceptually, the definition of customer focus is similar to that of customer opinion. However, the target of this study is discovering distinguishing customer focus sets that frequently occur in the set of reviews for one item but infrequently occur in the set of reviews of another item. Thus, our problem is significantly different from the problem of opinion mining. To the best of our knowledge, the work by Wang et al. [15] which summarizes product reviews by selecting the most representative review sentences, is the most related previous studies to our paper. Generally, not so good to use [15] as a substantive in English language. But others do so too. There are two essential differences between [15] and this study. First, [15] extracts representative sentences from reviews, while our method extracts customer focuses (quality phrases) from reviews. Second, the target of [15] is to summarize customer reviews, while the target of our method is comparing two sets of customer reviews. 3.3 Contrast Mining Contrast mining discovers patterns and models that manifest significant differences between data sets. Dong and Bailey [16] presented a comprehensive review of contrast mining, together with a series of real-life applications. Several types of contrast patterns have been proposed, such as emerging pattern [17], contrast set [18], and distinguishing sequential pattern [19]. To the best of knowledge, there is no previous work on mining contrast patterns from customer reviews. It is possible to treat each review as a sequence in which each word is an element, so that distinguishing sequential patterns can be discovered. However, it is hard for people to understand such patterns, since the previous methods [19,20] don’t consider the comprehensibility of the discovered patterns. Thus, it is necessary to design a novel algorithm for the problem studied in this paper. 4 Design of dFocus-Miner In this section, we present our method, dFocus-Miner, for mining top-k distinguishing customer focus sets from R(q1 ) and R(q2 ). The framework of dFocusMiner includes three main steps: (i) generating candidate customer focuses (Sect. 4.1), (ii) extracting customer focuses from the candidates (Sect. 4.2), and Mining Distinguishing Customer Focus Sets for Online Shopping 55 (iii) evaluating the contrast scores of each customer focus set (Sect. 4.3). Below we discuss the most important techniques used in each step of dFocus-Miner, emphasizing the new ideas that make the results interpretable and user-friendly. 4.1 Candidate Customer Focus Generation As stated in Sect. 2, in our work a customer focus is a user comment on a certain aspect of an item. Intuitively, a customer focus should be: (i) representative, i.e., reflects the opinions of large number of customers; (ii) non-predefined, i.e., can be dynamically formed using unsupervised learning from customer reviews; (iii) comprehensible, i.e., it is easy for customers to understand. Observation 1. Customer reviews on a given shopping item contain phrases that reflect users’ concerns with respect to some aspects of the shopping item. Example 1. Given a user review “My wife loves this phone. Very easy to use. She loves the big screen size”. We can see that the user concerns include: “easy to use” and “big screen size”. By Observation 1, we extract “qualified” phrases from the customer reviews as the candidate customer focuses. We employ Quality Phrases Miner, the algorithm proposed by Liu et al. [21], to generate candidate customer focuses from the set of customer reviews. Previous studies, such as [21,22], verified that: – Quality Phrase Miner is efficient, effective and scalable for mining quality phrases. – Quality Phrase Miner extracts phrases by considering the requirements on frequency, concordance and completeness. Clearly, the performance of Quality Phrase Miner satisfies our requirements on generating candidate customer focuses. Given two query items q1 and q2 , the size of R(q1 ) may be unbalanced to that of R(q2 ). Thus, we apply Quality Phrase Miner to separately extract quality phrases from R(q1 ) and R(q2 ). 4.2 Customer Focus Selection Table 2 contains example quality phrases of reviews of a hotel in New York city. Clearly, the semantics of some phrases are similar. For example, “short walk ”, “minutes walk ” and “walking distance” are related to the hotel location. Such similar semantic phrases may cause a redundancy problem. To solve this problem, we use K-means clustering algorithm to remove the semantically redundant focuses. The similarity measure used in clustering is based on the path similarity of individual words, which is a well-known semantic distance defined for WordNet [23]. Given two words w and w , we denote the path similarity between w and w by P athSim(w, w ). As defined in [23], the range of P athSim(w, w ) is [0.0, 1.0]. Larger values of P athSim(w, w ) indicate higher degree of similarity between w 56 L. Liu et al. Table 2. Partial list of quality phrases of two products Cutomer reviews Quality phrases r1 short walk, comfy bed, studio suite r2 minutes walk, fully equipped, staff is nice r3 bedroom suite, grand central station r4 good location, walking distance and w . In practice, the phrases are rather short, and based on our experimental analysis, we propose the following similarity measure for candidate customer focuses (phrases). Given two candidate customer focuses f and f  , the similarity between f and f  , denoted by Sim(f, f  ), is defined by Eq. 3.   w∈f,w ∈f  P athSim(w, w )  (3) Sim(f, f ) = |f | × |f  | where w ∈ f means f contains w, and |f | is the number of words in f . After clustering, we use the centers of clusters as customer focuses. For the sake of clarity, given a phrase f , we denote the center of the cluster that f belongs to by C(f ). Once the customer focuses are available, we use them to label the reviews. Specifically, for a given customer review r, we use a set of customer focuses, which are the closest to the review semantically, to represent it. The set of labels of r, denoted by L(r), is L(r) = {C(f ) | f is a phrase in r}. 4.3 (4) Mining Top-k Distinguishing Customer Focus Sets Given the representative customer focuses of two different products, we want to find k focus sets with biggest cScore values. Intuitively, we can apply an emerging pattern (EP) mining method for this task. Recall that an item-set is an EP if its support in one class is greater than α while the support in the other class is less than β [16]. Both α and β are parameters whose values are set by users. To the best of our knowledge, DPMiner [24] is the most efficient method to mine emerging patterns. Clearly, it is difficult for a user to set up proper α and β parameters. To solve this problem, we designed a top-k distinguishing frequent pattern mining method based on the above emerging pattern mining method. dFocus-Miner discovers the focus sets with top-k cScore. Specifically, let F S be the set of all candidate focus-sets, Sup1 (f s) is the number of reviews for the query item that contain focus set f s, and Sup2 (f s) the number of reviews for the query reference item that contain focus set f s. Then the mining result of dFocus-Miner is {f s ∈ F S | f s with top k precedence.} Mining Distinguishing Customer Focus Sets for Online Shopping 57 Algorithm 1. dFocus-Miner(q1 , q2 , k) Input: q1 , q2 : two shopping items to be compared; k: an integer Output: F : the set of top-k distinguishing customer focus sets 1: P1 ← quality phrases in R(q1 ); P2 ← quality phrases in R(q2 ); 2: P ← P1 ∪ P2 ; 3: F ← representative customer focuses in (P); 4: L1 ← ∅; L2 ← ∅; 5: for each review r ∈ R(q1 ) do 6: L1 ← L1 ∪ L(r); // L(r) is the set of customer focuses labels of r (Equation 4) 7: end for 8: for each review r ∈ R(q2 ) do 9: L2 ← L2 ∪ L(r); 10: end for 11: F ← topkEPMiner(L1 , L2 , k); 12: return F ; Due to the limited space, we only describe some important techniques of mining top-k distinguishing customer focus sets. Two pruning rules are used to improve the efficiency. ˙ Pruning Rule 1. For focus f , if there is no review r ∈ R(q1 ) satisfying f r, then any focus set f s containing f can be pruned. If focus f only occurs in reference query item R(q2 ), then any focus set f s containing f cannot be in top-k distinguishing customer focus sets. Pruning Rule 2. Let cScorek be the k-th largest cScore found. If focus f satisfies Sup(f, R(q1 )) < cScorek , then any focus set f s containing f can be pruned. If the support of a focus set is less than cScorek , none of its super-sets can satisfy the definition of top-k distinguishing customer focus sets. The whole framework of our Distinguishing Customer Focus Set Miner (dFocus-Miner) is presented in Algorithm 1. 5 Empirical Evaluation This section presents the results of our experimental studies; we have studied the effectiveness and efficiency of our proposed method (dFocus-Miner). All experiments were conducted on a PC with an Intel Core i7-4770 3.40 GHz CPU, and 16GB main memory. dFocus-Miner was implemented using Java and Python. 5.1 Case Study for Effectiveness Evaluation We use three real-world data sets crawled from different web sites containing reviews of hotels, movies and products, as summarized in Table 3. In Table 3, the items labeled ‘+’ are query items, while the ones labeled ‘−’ are query 58 L. Liu et al. Table 3. Statistics of data sets Category Item #customer reviews Hotel Lotte New York Palace (+) Dumont NYC (−) Movie Life of Pi (+) The Revenant (−) Laptop Dell Inspiron i7559-763BLK (+) ASUS K501UX (−) 822 359 747 1091 690 496 reference items. For every two sets of customer reviews from the same website, we apply dFocus-Miner to the reviews to find top-10 distinguishing customer focus sets (Table 4). In addition, for the sake of comparison, we list top-10 frequent + customer focuses of the two query items in the same website, denoted by F10 − and F10 , respectively, in Table 5. Table 4. Top-10 distinguishing customer focus sets Hotel Movie Laptop blocks from the empire state ang lee screen bleed barking dog restaurant human spirit coms battery chrysler building human spirit, ang lee gb ram studio suite richard parker single screw residential area suraj sharma screen bleed, brand new murray hill crouching tiger stick of ram minutes walk suraj sharma, ang lee sleep mode east river suraj sharma, human spirit sleep mode, ultra settings queen beds richard parker, ang lee ultra settings, brand new times square richard parker, human spirit sata bay, dell website TripAdvisor. First, we consider mining distinguishing customer focus sets from reviews on hotels provided by TripAdvisor (http://www.tripadvisor.com), which provides over 200 million supposedly unbiased traveler reviews of hotels, restaurants and vacation rentals. We compare Lotte New York Palace Hotel against Dumont NYC, as they were ranked 97 and 98 of 470 hotels, respectively, in New York city. Their recent prices are the same – $449 per night. Both of their average ratings from customers are the same – 4.5 out of 5. By Table 4, we observe that the interesting focuses mined by our method are meaningful from the user view point. For example, “studio suite” and “queen Mining Distinguishing Customer Focus Sets for Online Shopping 59 Table 5. Top-10 frequent customer focuses + F10 Hotel Movie Laptop empire state new york barking dog restaurant times square murray hill residential area corner suite chrysler building better than grand central life of pi ang lee crouching tiger richard parker suraj sharma pacific ocean visual poetry irfan khan yann martel named richard parker brand new ultra settings sleep mode stick of ram battery life discrete graphics number pad gb ram screen bleed display driver leonardo dicaprio’s robert redford hugh glass bear grylls g rard depardieu long takes using only natural light cinematic experience alejandro gonz zoo owner brand new light weight battery life stick of ram sleep mode number pad discrete graphics k lx drive bay keyboard backlight − F10 new york control panel front desk patrick’s cathedral grand central like royalty screen tv times square better than even though beds” are the meaningful distinguishing focuses about Lotte New York Palace. From the website, we can find that this hotel provides many types of suites that are popular. Thus, these two customer focuses are interpretable descriptions which are different from hotel introduction. Another discovered focus “blocks from the empire state” is also an interpretable quality phrase, since the hotel’s location is given by a landmark building instead of a detailed address. From Table 5, it is interesting to see that the phrases in top-10 distinguishing customer focus sets of Lotte New York Palace Hotel against Dumont NYC are different from the top-10 frequent customer focuses of Lotte New York Palace Hotel. For example, “new york” is frequent in the reviews of both Lotte New York Palace Hotel against Dumont NYC, thus it is not a result of dFocus-Miner. While we can see that “empire state” is a distinguishing characteristic of Lotte New York Palace Hotel. Clearly, dFocus-Miner can provide more information for users. IMDb. Second, we consider mining distinguishing customer focus sets from movie reviews provided by IMDb (http://www.imdb.com), which is the world’s most popular and authoritative source for movie information. We compare Life 60 L. Liu et al. of Pi against The Revenant, both of which are Oscar winning movies, and have similar ratings of 8 and 8.1 in IMDb. From Table 4, which lists the top-10 distinguishing customer focus sets identified by dFocus-Miner, we find that the top-10 distinguishing customer focus sets are mostly about the director (Ang Lee), the leading actor (Suraj Sharma) and the leading role (Richard Parker) of Life of Pi. Intuitively, the famous director and actors are main factors to attract people to watch a movie. Our proposed method can find these main factors for customers. Please note that the distin+ ) are very dissimilar to the distinguishing customer focus sets in Table 5 (F10 guishing customer focus sets in Table 4. For example, “life of pi”, a top frequent customer focus, is not in a top distinguishing customer focus set. The reason is that dFocus-Miner clusters the candidate customer focuses, and uses the centers as customer focuses to label each review. For the cluster, whose center is “life of pi”, there are several candidate customer focuses belonging to the reviews for The Revenant. Intuitively, the distinguishing customer focus sets listed in Table 4 offer more information highly related to the query-items. Amazon. Last, we apply dFocus-Miner to the reviews on electronic products provided by Amazon (http://www.amazon.com), which is a famous online shopping website. We compare a DELL laptop against to an ASUS laptop, since they have similar configurations in CPU, RAM, hard disc, and screen size, and the same price $799.99. The product information for these products can make the choice difficult for buyers who are not professional in computer. dFocus-Miner can help them to find distinguishing comments in comprehensible focuses. As listed in Table 4, dFocus-Miner finds some interesting distinguishing focus sets for the DELL laptop. For example, “screen bleed ” means that LCD products can get light leaking from the back. By this customer focus, it is better for users to check the screen if they bought a DELL laptop. Also many users mentioned “coms battery”, which indicates a quality issue. Moreover, from Table 5, we can see that these two laptops share several characteristics from users’ views, such as “brand new”, “stick of ram”, and “battery life”. However, dFocus-Miner can find distinguishing users’ concerns of the query item. Effect of k. We investigate the effect of mining results of dFocus-Miner (i.e., distinguishing customer focus sets) when the value of k (the number of top distinguishing customer focus sets) changes from 10 to 50 by step of 10. Figure 1 illustrates the average of cScore with respect to k. It is clear that the value of cScore decreases with the increase of k. As stated in Sects. 4.1 and 4.2, dFocusMiner extracts “qualified” phrases from the customer reviews, and clusters them to get customer focuses. Figure 2 illustrates the average length of customer focus, i.e., the average number of words in a customer focus. It is interesting to see that with the increase of k, the average length changes slightly. The reason is that the length of a qualified phrases extracted by Quality Phrase Miner is typically between 2 and 3. Please recall that the mining results of dFocus-Miner are top distinguishing customer focus sets. From Fig. 3, we can see that the average size Mining Distinguishing Customer Focus Sets for Online Shopping 0.14 0.12 0.1 0.1 Avg( cScore ) 0.4 Avg( cScore ) Avg( cScore ) 0.16 61 0.3 0.2 0.08 0.08 0.06 0.04 0.02 0.06 10 20 30 k 40 0.1 10 50 (a) TripAdvisor 20 30 k 40 0 10 50 (b) IMDb 20 30 k 40 50 40 50 (c) Amazon Fig. 1. Average of cScore of top-k distinguishing customer focus sets 2.4 2.2 2 10 20 30 k 40 2.3 2.2 2.1 10 50 2.3 AvgLen(F ) 2.4 AvgLen(F ) AvgLen(F ) 2.6 (a) TripAdvisor 20 30 k 40 2.2 2.1 2 10 50 (b) IMDb 20 30 k (c) Amazon 2.2 2 1.6 2 1.8 1.4 1.2 1 10 20 30 k 40 50 AvgSize(F) 1.8 AvgSize(F) AvgSize(F) Fig. 2. Average length of customer focuses in top-k distinguishing customer focus sets 1.8 1.6 1.4 10 (a) TripAdvisor 20 30 k (b) IMDb 40 50 1.6 1.4 1.2 10 20 30 k 40 50 (c) Amazon Fig. 3. Average size of of top-k distinguishing customer focus sets of customer focus sets increase when the k changes from 10 to 50, because when the value of k increases, more customer focus sets can be found by dFocus-Miner. This is consistent with the intuition. 5.2 Efficiency Evaluation We characterize the dFocus-Miner executions with three parameters: γ is the ratio of K-means clustering numbers – if we have m customer focuses, then K = γ · m; μ describes the proportion of the review data set being used thus 62 L. Liu et al. 3 γ = 1.0 γ = 0.6 γ = 0.2 2 1 0 10 20 30 k 40 150 γ = 1.0 γ = 0.6 γ = 0.2 100 50 0 10 50 2 (a) TripAdvisor 20 30 k 40 Runtime (s) 200 Runtime (s) Runtime (s) 4 1.5 γ = 1.0 γ = 0.6 γ = 0.2 1 0.5 0 10 50 (b) IMDb 20 30 k 40 50 (c) Amazon Fig. 4. Runtime of dFocus-Miner w.r.t. the ratio of K-means clustering numbers 20 0.4 0.35 0.3 10 20 30 k 40 (a) TripAdvisor 50 3 15 μ = 1.0 μ = 0.6 μ = 0.2 10 5 0 10 Runtime (s) μ = 1.0 μ = 0.6 μ = 0.2 Runtime (s) Runtime (s) controlling the input set size; the third parameter is k for top-k distinguishing customer focus sets. For our proposed methods, the overall runtime includes crawling reviews from websites, generating customer focuses, selecting customer focuses by representative quality phrases, and mining top-k distinguishing customer focus sets. The crawling time depends on factors such as data transfer speed and the amount of available data and the second part is discussed in [21], thus we only consider here the customer focuses selection and mining distinguishing focus sets. We first use the three data sets described in Table 3 to evaluate how the runtime changes when k changes from 10 to 50 by step of 10. Here, we use the full review sets, i.e., μ = 1.0. The results are shown in Fig. 4. Three lines representing μ values 0.2, 0.6, and 1.0 show that our method is not sensitive to the change of k. However, the proportion of clustering clearly has an effect in the runtime, showing that our proposed method of selecting customer focuses can not only remove the redundant focuses but also reduce the runtime. 20 30 k (b) IMDb 40 50 2 μ = 1.0 μ = 0.6 μ = 0.2 1 0 10 20 30 k 40 50 (c) Amazon Fig. 5. Runtime of dFocus-Miner w.r.t. the number of reviews From Fig. 4, we see that the number of clusters will influence the runtime of our proposed method, i.e., K is decided by the number of customer focuses. However the number of customer focuses m is decided by the number of reviews. Mining Distinguishing Customer Focus Sets for Online Shopping 63 We randomly chose 20 %, 60 % and 100 % of the reviews for different test runs. Figure 5 shows the results. We can see that the runtime increases linearly with the input size. Both figures prove that our pruning rules of our novel method in mining top-k distinguishing customer focus sets are efficient. 6 Conclusions In this paper, we proposed and studied a new problem of mining distinguishing customer focus sets which has wide applications, and which is useful for online shopping decision support. We designed an algorithm called dFocus-Miner to find such customer focuses. Our experiments on real world data sets verify the effectiveness and efficiency of dFocus-Miner. As future work, we plan to introduce a domain-independent phrase collection method to reduce the number of candidate customer focuses, and try some other methods to generate customer focus besides the clustering method used in this work. It is also interesting to introduce some natural language processing methods to refine the reviews. Moreover, we will explore some distributed computing techniques to improve the performance of dFocus-Miner. In addition, we will explore some domain-driven methods for quality phrase mining to improve the effectiveness of dFocus-Miner in specific domains. References 1. Ghose, A., Ipeirotis, P.G.: Designing novel review ranking systems: predicting the usefulness and impact of reviews. In: Proceedings of the 9th International Conference on Electronic Commerce: The Wireless World of Electronic Commerce, pp. 303–310 (2007) 2. Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: Proceedings of the 8th IEEE International Conference on Data Mining, ICDM, pp. 263–272 (2008) 3. Koren, Y., Bell, R.M., Volinsky, C.: Matrix factorization techniques for recommender systems. IEEE Comput. 42(8), 30–37 (2009) 4. Li, X., Xu, G., Chen, E., Li, L.: MARS: a multi-aspect recommender system for point-of-interest. In: Proceedings of the 31st IEEE International Conference on Data Engineering, ICDE, pp. 1436–1439 (2015) 5. Zheng, V.W., Cao, B., Zheng, Y., Xie, X., Yang, Q.: Collaborative filtering meets mobile recommendation: a user-centered approach. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence, AAAI (2010) 6. Zhang, W., Wang, J., Feng, W.: Combining latent factor model with location features for event-based group recommendation. In: Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining, KDD, pp. 910–918 (2013) 7. Karamshuk, D., Noulas, A., Scellato, S., Nicosia, V., Mascolo, C.: Geo-spotting: mining online location-based services for optimal retail store placement. In: Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining, KDD, pp. 793–801 (2013) 64 L. Liu et al. 8. Mukherjee, S., Basu, G., Joshi, S.: Incorporating author preference in sentiment rating prediction of reviews. In: Proceedings of the 22nd International World Wide Web Conference, WWW, pp. 47–48 (2013) 9. Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression approach. In: Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining, KDD, pp. 783–792 (2010) 10. Zhang, F., Zheng, K., Yuan, N.J., Xie, X., Chen, E., Zhou, X.: A novelty-seeking based dining recommender system. In: Proceedings of the 24th International Conference on World Wide Web, WWW, pp. 1362–1372 (2015) 11. Li, X., Xu, G., Chen, E., Li, L.: Learning user preferences across multiple aspects for merchant recommendation. In: Proceedings of the 15th IEEE International Conference on Data Mining, ICDM, pp. 865–870 (2015) 12. Hu, M., Liu, B.: Mining opinion features in customer reviews. In: Proceedings of the 19th AAAI Conference on Artificial Intelligence, 16th Conference on Innovative Applications of Artificial Intelligence, AAAI, pp. 755–760 (2004) 13. Zhao, Q., Wang, H., Lv, P., Zhang, C.: A bootstrapping based refinement framework for mining opinion words and targets. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM, pp. 1995–1998 (2014) 14. McAuley, J.J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM Conference on Recommender Systems, pp. 165–172 (2013) 15. Wang, D., Zhu, S., Li, T.: Sumview: a web-based engine for summarizing product reviews and customer opinions. Expert Syst. Appl. 40(1), 27–33 (2013) 16. Dong, G., Bailey, J. (eds.): Contrast Data Mining: Concepts, Algorithms, and Applications. CRC Press (2012) 17. Dong, G., Li, J.: Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining, KDD, pp. 43–52 (1999) 18. Bay, S.D., Pazzani, M.J.: Detecting group differences: mining contrast sets. Data Min. Knowl. Disc. 5(3), 213–246 (2001) 19. Ji, X., Bailey, J., Dong, G.: Mining minimal distinguishing subsequence patterns with gap constraints. Knowl. Inf. Syst. 11(3), 259–286 (2007) 20. Yang, H., Duan, L., Dong, G., Nummenmaa, J., Tang, C., Li, X.: Mining itemsetbased distinguishing sequential patterns with gap constraint. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications, DASFAA, pp. 39–54 (2015) 21. Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: Proceedings of the 36th ACM International Conference on Management of Data, SIGMOD, pp. 1729–1744 (2015) 22. El-Kishky, A., Song, Y., Wang, C., Voss, C.R., Han, J.: Scalable topical phrase mining from text corpora. PVLDB 8(3), 305–316 (2014) 23. Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995) 24. Li, J., Liu, G., Wong, L.: Mining statistically important equivalence classes and delta-discriminative emerging patterns. In: Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining, KDD, pp. 430–439 (2007) Community Detection in Networks with Less Significant Community Structure Ba-Dung Le(B) , Hung Nguyen, and Hong Shen School of Computer Science, The University of Adelaide, Adelaide 5005, Australia {badung.le,hung.nguyen,hong.shen}@adelaide.edu.au Abstract. Label propagation is a low complexity approach to community detection in complex networks. Research has extended the basic label propagation algorithm (LPA) in multiple directions including maximizing the modularity, a well-known quality function to evaluate the goodness of a community division, of the detected communities. Current state-of-the-art modularity-specialized label propagation algorithm (LPAm+) maximizes modularity using a two-stage iterative procedure: the first stage is to assign labels to nodes using label propagation, the second stage merges smaller communities to further improve modularity. LPAm+ has been shown able to achieve excellent performance on networks with significant community structure where the network modularity is above a certain threshold. However, we show in this paper that for networks with less significant community structure, LPAm+ tends to get trapped in local optimal solutions that are far from optimal. The main reason comes from the fact that the first stage of LPAm+ often misplaces node labels and severely hinders the merging operation in the second stage. We overcome the drawback of LPAm+ by correcting the node labels after the first stage. We apply a label propagation procedure inspired by the meta-heuristic Record-to-Record Travel algorithm that reassigns node labels to improve modularity before merging communities. Experimental results show that the proposed algorithm, named meta-LPAm+, outperforms LPAm+ in terms of modularity on networks with less significant community structure while retaining almost the same performance on networks with significant community structure. Keywords: Community detection · Label propagation · LPAm · metaLPAm · LPAm+ · meta-LPAm+ 1 Introduction Complex networks often represent the network structure of complex systems in the real-world [24]. Communities in complex networks are informally defined as groups of nodes where interactions between nodes in the same group are more frequent than interactions between nodes in different groups [25]. Revealing community structure in complex networks is important to understand the organization of the networks [11]. c Springer International Publishing AG 2016  J. Li et al. (Eds.): ADMA 2016, LNAI 10086, pp. 65–80, 2016. DOI: 10.1007/978-3-319-49586-6 5 66 B.-D. Le et al. Community detection has recently become an active research topic in various disciplines such as computer science, physics and sociology [7,26]. Among the existing community detection techniques, the label propagation algorithm (LPA) [27] is known as a time-efficient method for detecting communities. Extensive research has improved LPA in different aspects including network modularity [1], real-time detection [17], overlapping community detection [10], and robustness [30]. The modularity-specialized label propagation algorithm (LPAm) [1] is an enhanced version of LPA that improves quality of the detected community structure by propagating labels of nodes to maximize network modularity, a well-known quality function to evaluate the goodness of a community division [22]. Advanced modularity-specialized label propagation (LPAm+) [18] is a further improvement of LPAm that iteratively combines LPAm with merging pairs of communities to increase network modularity. LPAm+ achieves excellent performance on networks with significant community structure where the network modularity is above a certain threshold [18], which is practically set at 0.3 [20]. Networks with lower modularity, such as the dependency networks of software packages in [32], often have a less significant community structure, emerging from random connections between network entities. The performance of LPAm+ on networks with less significant community structure, which is the focus of this paper, is generally not well understood. Detecting less significant community structure is important, for it would help to predict the evolution of communities early. We first show that LPAm+ tends to get trapped in poor local maxima on networks with less significant community structure. The poor local optimal solution often contains misplaced nodes which are placed in the same community with nodes belonging to different communities in the global optimal solution. We address this issue in LPAm+ by employing a meta-heuristic based label propagation procedure before merging communities to adjust the misplaced nodes. Our main contribution is a novel way of using the meta-heuristic Record-toRecord Travel [6] to improve LPAm+ but keeping the overall complexity low. Experimental results show that the proposed algorithm, named meta-LPAm+, performs better, in term of modularity, than LPAm+ on networks with less significant community structure while retaining almost the same performance on networks with significant community structure. 2 2.1 Background: LPA, LPAm and LPAm+ LPA LPA [27] works by initially assigning a unique label to every node in the network. The algorithm then iteratively updates the labels of nodes in a random sequential order. At each iteration, each node updates its label by the label that the maximum number of its neighbours hold. If a node has many candidate labels to update, one of the labels is chosen randomly. The label updating rule for a node u in an undirected and unweighted network of n nodes can be mathematically Community Detection in Networks specified as lunew = argmax  lu n  67 Auv δ(lu , lv ), v=1 is the new label to be assigned to node u, lu is a label of the neighbours where of u, v is a node in the network, lv is the label of node v, Auv is the element of the adjacency matrix of the network representing the connection between node u and node v, and δ is the Kronecker delta function. The label updating process stops when node labels remain unchanged after an iteration. Communities are identified as groups of nodes holding the same labels. To ensure the convergence of the algorithm, node labels are updated asynchronously which means that a node updates its label based on the labels in the previous iteration of some of its neighbours and the labels in the current iteration of the other neighbours. LPA has a near linear time complexity of O(m), where m is the number of edges in the network [27]. The objective function of LPA can be understood as finding a division of the network that maximizes the number of edges falling within communities [1]. The trivial solution of LPA is to assign every node in the network the same label. This illustrates a potential drawback of LPA that the detected community structure does not necessarily have any meaningful interpretation. lunew 2.2 LPAm LPAm [1] modifies the label updating rule in LPA to drive the solution toward the most improvement of network modularity - a well-known quality measurement of community structure [23]. Modularity measures the fraction of edges within communities minus the expected fraction of edges falling inside the same community division on a random network with the same node degrees [23]. The modularity of a network is defined as [23]   ki kj 1  Aij − , Q= 2m g i,j∈g 2m where g is a community in the network, ki and kj are the degrees of node i and node j respectively, and m is the number of edges in the network. Starting from an initial community division, LPAm performs a label propagation step at each iteration. The label propagation step updates the label of every node by a label of its neighbours that leads to the maximal increase in network modularity. Let lu be the current label of a node u, and lu be the new label for node u after updating. Let gu and gu denote the communities of nodes holding label lu and lu respectively. The gain in modularity when updating label lu for node u is 68 B.-D. Le et al. ⎛ ⎞      1 ⎝  ki kj ki kj ⎠ ΔQ = Aij − Aij − + 2m 2m 2m  +u i,j∈gu −u i,j∈gu ⎛ ⎞      ki kj ki kj ⎠ 1 ⎝  − Aij − Aij − , + 2m 2m 2m  i,j∈g i,j∈gu (1) u where gu −u denotes the community formed by removing node u from its community, and gu +u denotes the community formed by adding node u into community gu . We can rewrite Eq. 1 as ⎛ ⎞      1 ⎝  ki kj ki kj ⎠ ΔQ = Aij − Aij − − 2m 2m 2m  +u  i,j∈gu i,j∈gu ⎛ ⎞      ki kj ki kj ⎠ 1 ⎝  − Aij − Aij − − 2m i,j∈g 2m 2m i,j∈g −u u u or equivalently, ΔQ =     1  ku kj ku kj 1  Auj − Auj − − m  2m m j∈g −u 2m j∈gu (2) u Since the second term of Eq. 2 remains unchanged for every choice of label lu , choosing label lu that maximizes ΔQ is equivalent to choosing label lu that maximizes the sum   ku kj Auj − (3) 2m  j∈gu Therefore, the label updating rule of LPAm can be expressed as  n   ku kv new Auv − lu = argmax δ(lu , lv )  2m lu v=1 (4) As LPAm is a greedy heuristic for finding an optimal community division regarding network modularity, the algorithm can therefore easily get trapped in local maxima of the modularity space [18]. 2.3 LPAm+ LPAm+ [18] drives LPAm out of local maxima by iteratively combining LPAm with merging pairs of communities that improves network modularity the most. LPAm+ utilizes the multistep greedy algorithm (MSG) in [29] to merge multiple pairs of communities simultaneously at a time [18]. After each round of LPAm, two communities having labels l1 and l2 are merged if (ΔQl1 →l2 > 0) ∧ [!∃l : (ΔQl1 →l > ΔQl1 →l2 ) ∨ (ΔQl2 →l > ΔQl1 →l2 )] Community Detection in Networks 69 with ΔQl1 →l2 denoting the gain in modularity when community having label l1 is merged with community having label l2 . The algorithm stops when no pair of communities can be merged to increase network modularity. LPAm+ achieves excellent performance on networks with significant community structure where the network modularity is above 0.3 [18]. The performance of LPAm+ for networks with less significant community structure has not been widely explored in the literature. We study the performance of LPAm+ on networks with modularity below 0.3 in the rest of this section. Figure 1a illustrates a toy network with two intuitively divided communities (one community with nodes in white colour and the other community with nodes in dark colour) and the modularity Q is 0.2117. Applying LPAm+ on this toy network, the first round of LPAm in LPAm+ results in a local maxima with modularity Q = 0.1097 (Fig. 1b), for the initial community division of nodes in their own community and the sequence of node orders {6, 0, 9, 5, 7, 11, 10, 1, 8, 2, 3, 4} to update the node labels. The first round of merging communities in LPAm+ further improves the modularity by merging community labeled ‘a’ and community labeled ‘d’, and merging community labeled ‘c’ and community labeled ‘e’ with the new modularity Q = 0.1607 (Fig. 1c). Note that this modularity value is still significantly below the modularity value of 0.2117 achieving in Fig. 1a. LPAm+ gets trapped in the local maxima Q = 0.1607 mainly due to the initially misplaced nodes 0, 3, 6 and 9, in Fig. 1b. These nodes are assigned the same community with the nodes which actually belong to different communities in the global optimal community solution by the first round of LPAm. LPAm+ does not have any mechanism to correct this misplacement of node labels and cannot adjust the labels in any other round of LPAm or merging communities. Fig. 1. A toy network consists of two intuitively divided communities with modularity Q = 0.2117 (a). The first round of LPAm in LPAm+ results in a local maxima with modularity Q = 0.1097 (b). The first round of merging communities in LPAm+ merges community labeled ‘a’ and community labeled ‘d’, and merges community labeled ‘c’ and community labeled ‘e’ with the new modularity Q = 0.1607 (c). LPAm+ gets trapped in this local maxima because the misplaced nodes 0, 3, 6 and 9 cannot be adjusted by any other round of LPAm or merging communities. 70 B.-D. Le et al. 3 Meta-LPAm+ We propose a method to overcome the drawback of LPAm+ identified in the previous section by employing a meta-heuristic based label propagation procedure before merging communities. Our meta-heuristic based label propagation procedure searches for a better local maxima than that given by LPAm to correct the misplaced nodes in the intermediate solutions given by LPAm. Meta-heuristics [3] can find a better optimal solution than simple heuristics by accepting a worse solution temporarily to explore more thoroughly the solution space. The proposed meta-heuristic based label propagation procedure is inspired by the Record-to-Record Travel (RRT) algorithm [6] for a balance between performance and computational complexity. The RRT algorithm is a variant of Simulated Annealing (SA) algorithm [14] with a different mechanism to accept a worse solution. RRT accepts a worse solution if the difference between the accepted solution and the best found solution is less than a specific threshold. RRT is reported to outperform SA with much lower running time for some optimization problems [6]. To the best of our knowledge, RRT has never been applied to the task of community detection. To escape local maxima, we first performs a label propagation step that accepts a decrease in network modularity. In the label updating step, the label of every node is updated sequentially by a label of its neighbours if the decrease in network modularity is less than a predetermined threshold (DEVIATION). The decrease in network modularity is calculated as the difference between the current modularity and the highest network modularity found. After accepting a worse solution, the meta-heuristic based label propagation procedure performs a round of LPAm to quickly improve network modularity. The process is repeated until the number of iterations without improvement in the network modularity exceeds a predefined number of iterations (maxno improve ). The improved LPAm+ algorithm, named meta-LPAm+, is basically an iterative combination of the modularity-specialized label propagation algorithm LPAm in [1], the meta-heuristic based label propagation algorithm, named metaLPAm, and the merging pairs of communities algorithm MSG in [29]. The pseudo code of meta-LPAm and meta-LPAm+ are presented in Algorithms 1 and 2. Figure 2 illustrates how meta-LPAm+ converges to the global maxima for the toy network in Fig. 1a. Meta-LPAm+ starts by applying the first round of LPAm to produce the local maxima in Fig. 1b. Meta-LPAm+ then applies the first round of meta-LPAm with the input parameters DEVIATION = 0.01 and maxno improve = 50 to improve this local maxima by the following steps. First, the first round of meta-LPAm performs a label propagation step that assigns label ‘e’ for node 1 and label ‘i’ for node 2 with modularity decreased by 0.0083 and the modularity is 0.1014. The next round of LPAm embedded in metaLPAm then reaches to a better local maxima with modularity Q = 0.1811. The first round of merging communities in meta-LPAm+ merges community labeled ‘a’ and community labeled ‘i’ to increase modularity to 0.2117. The algorithm stops at the global maximal solution because the network modularity cannot be improved by any further round of LPAm, meta-LPAm or merging communities. Community Detection in Networks 71 Algorithm 1. meta-LPAm Input: An initial community division S, the threshold DEVIATION and the maximum number of iterations without improvement in modularity maxno improve Output: The best found community division RECORD 1: Set RECORD = S and nno improve = 0 2: While nno improve < maxno improve 3: Update the label of each node sequentially by a label of its neighbours if 4: the new network modularity QS  ≥ QRECORD − DEV IAT ION 5: Maximize network modularity by LPAm 6: Set nno improve = nno improve + 1 7: If QS > QRECORD Then 8: Set RECORD = S and nno improve = 0 9: End if 10: End While /* S’ is the new community division to assign to S if the node label is updated */ Algorithm 2. meta-LPAm+ Input: The threshold DEVIATION and the maximum number of iterations without improvement in modularity maxno improve for the embedded meta-LPAm algorithm Output: The best found community division 1: Assign every node into its own community 2: Maximize network modularity by LPAm 3: Find a better local maxima by meta-LPAm 4: While ∃ a pair of communities (l1 , l2 ) with ΔQl1 ,l2 > 0 5: Merging pairs of communities by MSG 6: Maximize network modularity by LPAm 7: Find a better local maxima by meta-LPAm 8: End While /* ΔQl1,l2 is the change in network modularity when merging community labeled ’l1’ and community labeled ’l2’ */ As with many other meta-heuristics, the values of the input parameters in meta-LPAm+, DEVIATION and maxno improve , determines the trade-off between performance and complexity of the algorithm. Increasing DEVIATION and maxno improve allows the algorithm to explore larger solution space but requires more time to converge. For example, while LPAm+ finds the global optimal solution on the toy network in Fig. 1a in about 50 times out of 100 runs with randomized initial solutions, the number of times that meta-LPAm+ finds the global maxima depends on the input parameter settings. Meta-LPAm+ finds the global maxima in about 70 times over 100 runs for DEVIATION = 0.01 and maxno improve = 10. However, for DEVIATION = 0.1 and maxno improve = 10, meta-LPAm+ finds the global maxima in almost all of the 100 runs of the algorithm. The input parameter maxno improve has less effect on the results of metaLPAm+ providing that the parameter is set to be large enough for the algorithm to explore the neighborhood modularity space. After conducting a large number of experiments for many values of DEVIATION and maxno improve , the best 72 B.-D. Le et al. Fig. 2. A pathway to converge to global maxima by meta-LPAm+ for the toy network in Fig. 1a. Giving the community solution in Fig. 1b as the local maxima returned by the first round of LPAm in meta-LPAm+, the first round of meta-LPAm in metaLPAm+ improves this local maxima by: first performing a label propagation step that decreases modularity by 0.0083 and the modularity is 0.1014 (a), and then performing a round of LPAm to reach to another local maxima with modularity Q = 0.1811 (b). The first round of merging communities in meta-LPAm+ results in the global maxima with modularity Q = 0.2117 (c). combination was found to be 0.01 and 50 respectively for large-scale networks with more than a thousand of nodes. 4 Experimental Results To demonstrate the effectiveness of our modification on LPAm+, we compared the performance of LPAm, LPAm+ and meta-LPAm+ on both synthetic networks and real-world networks. We used the standard Lancichinetti-FortunatoRadicchi (LFR) benchmark [16] to generate the artificial networks for two different network sizes N = 1000 and N = 5000. The mixing parameter μ ranges from 0.1 to 0.8 to vary the modularity of the generated networks. The other network parameters including the average degree of nodes, the maximum degree of nodes, the minimum community size and the maximum community size are set to 20, 50, 20 and 100 respectively. The power-law exponents for the node degree sequence and the community size sequence are set to -2. As for real-world networks, we use the commonly studied networks including the Zachary’s karate club network [31], the Lusseau’s dolphins’ network [19], the Political Books network [15], the American college football network [8], the Jazz musicians network [9], the C.elegans metabolic network [13], the University email network [12] and the Condmat2003 network [21]. For the network sizes, one can refer to [18]. Tables 1 and 2 show the average of network modularity, denoted by Qavg , the maximum network modularity, denoted by Qmax , and the running time (in second), denoted by t, of LPAm, LPAm+ and meta-LPAm+ on the LFR benchmark networks. Table 3 shows the performance of LPAm, LPAm+ and meta-LPAm+ on the real-world networks. The experiments were performed on R CoreTM i7 @ 3.40 GHz CPU and 8 GB RAM. Each a desktop PC with Intel  data point is the average over 100 graph realizations. Community Detection in Networks 73 Table 1. Comparison on the performance of LPAm, LPAm+ and meta-LPAm+ on the LFR benchmark networks with N=1000. Networks LPAm Qavg Qmax t(s) LPAm+ Qavg Qmax t(s) meta-LPAm+ Qavg Qmax t(s) μ = 0.1 0.844 0.860 0.005 0.844 0.860 0.005 0.844 0.860 0.171 μ = 0.2 0.746 0.762 0.005 0.747 0.762 0.007 0.747 0.762 0.204 μ = 0.3 0.645 0.663 0.010 0.647 0.663 0.009 0.647 0.663 0.255 μ = 0.4 0.542 0.559 0.012 0.547 0.559 0.014 0.547 0.561 0.310 μ = 0.5 0.441 0.462 0.013 0.449 0.462 0.018 0.449 0.462 0.478 μ = 0.6 0.336 0.360 0.023 0.346 0.364 0.033 0.348 0.364 0.677 μ = 0.7 0.224 0.241 0.034 0.237 0.258 0.062 0.250 0.265 3.325 μ = 0.8 0.190 0.204 0.034 0.205 0.212 0.071 0.219 0.226 2.698 Table 2. Comparison on the performance of LPAm, LPAm+ and meta-LPAm+ on the LFR benchmark networks with N = 5000. Network LPAm Qavg Qmax t(s) LPAm+ Qavg Qmax t(s) meta-LPAm+ Qavg Qmax t(s) μ = 0.1 0.888 0.890 0.037 0.889 0.890 0.054 0.889 0.890 2.337 μ = 0.2 0.788 0.791 0.044 0.790 0.791 0.071 0.790 0.791 2.985 μ = 0.3 0.687 0.691 0.054 0.690 0.692 0.088 0.690 0.692 3.731 μ = 0.4 0.584 0.591 0.071 0.591 0.593 0.111 0.591 0.593 4.449 μ = 0.5 0.480 0.489 0.087 0.493 0.495 0.151 0.493 0.495 5.303 μ = 0.6 0.375 0.385 0.111 0.394 0.396 0.204 0.394 0.396 6.610 μ = 0.7 0.273 0.281 0.201 0.293 0.298 0.453 0.294 0.298 14.250 μ = 0.8 0.201 0.207 0.229 0.211 0.215 0.633 0.228 0.232 33.388 As can be seen from Tables 1 and 2, meta-LPAm+ can find solutions with higher modularity than LPAm+ on the LFR benchmark networks with μ from 0.7 and above, corresponding to networks with modularity below 0.3. The improvement in the average of network modularity by meta-LPAm+ on LPAm+ is as high as the improvement of LPAm+ on LPAm, upto 6.5 %, on the low modularity networks, with acceptable time in a few seconds. Meta-LPAm+ and LPAm+ result in almost the same network modularity on the LFR benchmark networks with μ < 0.7, where the network modularity is above 0.3, and the real-world networks (Table 3). The results confirm that by employing the meta-heuristic based label propagation procedure before merging communities, meta-LPAm+ can adjust misplaced nodes in community solutions leading to the increase in network modularity on networks with less significant community structure. Meta-LPAm+ and LPAm+ achieve almost the same modularity on networks with significant 74 B.-D. Le et al. Table 3. Comparison on the performance of LPAm, LPAm+ and meta-LPAm+ on the real-world networks. Network LPAm Qavg Qmax t(s) LPAm+ Qavg Qmax t(s) meta-LPAm+ Qavg Qmax t(s) Karate 0.351 0.399 na 0.418 0.420 na 0.418 0.420 0.005 Dolphins 0.495 0.502 na 0.522 0.528 na 0.526 0.529 0.012 Political books 0.494 0.516 na 0.527 0.527 na 0.527 0.527 0.033 Football 0.582 0.603 na 0.604 0.605 na 0.604 0.605 0.030 Jazz 0.435 0.445 0.003 0.444 0.445 0.005 0.445 0.445 0.075 C. elegans 0.378 0.404 0.005 0.443 0.451 0.016 0.444 0.451 0.630 Email 0.493 0.537 0.011 0.576 0.581 0.035 0.578 0.582 1.897 PGP 0.703 0.718 0.071 0.883 0.885 0.980 0.883 0.885 67.701 community structure since LPAm is less likely to be trapped in local maxima with many misplaced nodes on these networks. This property of LPAm has been observed in previous works [1,18]. The meta-heuristic based label propagation procedure is therefore more or less unnecessary on networks with significant community structure. However, it is worth applying meta-LPAm+ on these networks without prior knowledge of the community structure significance because the meta-heuristic based label propagation procedure generally requires an extra running time within only a few seconds. The total running time of meta-LPAm+ on the tested networks of about ten thousands of nodes is often less than a minute which is adequate for practical applications. The computational complexity of meta-LPAm+ is mainly from the complexity of the three components embedded in the algorithm: (1) the modularity specialized label propagation algorithm LPAm, (2) the meta-heuristic based label propagation algorithm meta-LPAm, and (3) the merging communities algorithm MSG. One round of LPAm in meta-LPAm+ has a computational complexity of rO(m), with r being the average number of label propagation steps required for the round of LPAm to converge [1]. One round of meta-LPAm in meta-LPAm+ takes a computational cost of l(O(m)+rO(m)), with l being the average number of times that the round of meta-LPAm accepts a worse solution. The combination of LPAm and meta-LPAm in meta-LPAm+ thus has a computational complexity of sO(m) with s = r + l(1 + r) being the average number of label propagation steps needed for the round of LPAm and meta-LPAm to stop iterations. The computational cost of one round of MSG in meta-LPAm+ is O(mlogN ) [29]. Therefore, the total computational cost of meta-LPAm+ is s1 O(m) + h(O(mlogN ) + s2 O(m)) where s1 is the average number of label propagation steps of the round of LPAm and meta-LPAm before the while loop in meta-LPAm+, s2 is the average number Community Detection in Networks 75 Table 4. The values of s1 , s2 and h when applying meta-LPAm+ on the LFR benchmark networks. LFR benchmark networks N = 1000 s1 s2 h 0 0 0 157.60 152.0 h μ = 0.1 154.60 μ = 0.2 156.10 μ = 0.3 156.70 45.60 0.30 μ = 0.4 159.20 76.0 0.50 166.30 152.0 2.50 μ = 0.5 162.90 106.40 0.70 170.900 152.0 2.90 212.60 107.95 0.80 262.20 152.0 3.40 μ = 0.6 0 N = 5000 s1 s2 1.60 158.60 152.0 2.10 159.10 152.0 2.20 μ = 0.7 1134.60 492.0 1.20 498.60 267.44 3.80 μ = 0.8 1605.90 185.0 0.40 2859.50 444.50 0.70 Table 5. The values of s1 , s2 and h when applying meta-LPAm+ on the LFR benchmark networks of different sizes for μ = 0.7 and μ = 0.8. LFR benchmark networks μ = 0.7 s1 s2 N = 1000 h μ = 0.8 s1 s2 h 1208.32 395.67 1.05 1535.29 114.35 0.18 N = 2000 998.69 412.10 2.58 2069.66 N = 3000 775.20 324.07 3.04 2313.99 280.67 0.34 98.35 0.11 N = 4000 654.94 280.86 3.42 2694.37 364.51 0.52 N = 5000 543.48 266.60 3.74 3317.09 464.60 0.77 N = 6000 503.91 241.29 4.09 3599.37 650.20 1.23 N = 7000 503.76 230.22 4.24 3674.05 828.16 1.54 N = 8000 512.91 224.34 4.39 4106.22 874.32 1.99 N = 9000 503.21 219.22 4.68 4479.50 773.13 2.20 N = 10000 499.88 218.94 4.84 4810.25 793.28 2.69 of label propagation steps of the round of LPAm and meta-LPAm in the while loop, and h is the average number of iterations for the while loop. Table 4 shows the values of s1 , s2 and h when applying meta-LPAm+ on the LFR benchmark networks. According to the tables, the values of s1 and s2 are almost stable for μ ≤ 0.7, where the network modularity is high above 0.3, whereas the values of s1 and s2 are much larger for μ from 0.7 and above, where the network modularity is low below 0.3. This evidently indicates that LPAm results in community solutions with more misplaced nodes on the lower modularity networks leading to the subsequent increase in the number of iterations needed for the round of LPAm and meta-LPAm to converge. Table 5 reports the values of s1 , s2 , and h when applying meta-LPAm+ on the LFR benchmark networks of different sizes for μ = 0.7 and μ = 0.8, where 76 B.-D. Le et al. Table 6. Comparison on the performance of meta-LPAm+, FastGreedy, MSG-VM and the Louvain method on the LFR benchmark networks with N = 1000. Network meta-LPAm+ Fast greedy MSG-VM Louvain Qavg Qmax Qavg Qmax Qavg Qmax Qavg Qmax μ = 0.1 0.844 0.860 0.833 0.845 0.831 0.846 0.844 0.860 μ = 0.2 0.747 0.762 0.707 0.726 0.732 0.748 0.747 0.762 μ = 0.3 0.647 0.663 0.584 0.602 0.631 0.646 0.647 0.663 μ = 0.4 0.547 0.561 0.467 0.490 0.530 0.543 0.547 0.561 μ = 0.5 0.449 0.462 0.356 0.381 0.431 0.445 0.449 0.462 μ = 0.6 0.348 0.364 0.260 0.277 0.331 0.346 0.347 0.364 μ = 0.7 0.250 0.265 0.202 0.218 0.222 0.238 0.231 0.250 μ = 0.8 0.219 0.226 0.189 0.196 0.195 0.203 0.198 0.208 Table 7. Comparison on the performance of meta-LPAm+, FastGreedy, MSG-VM and the Louvain method on the LFR benchmark networks with N = 5000. Network meta-LPAm+ Fast greedy MSG-VM Louvain Qavg Qmax Qavg Qmax Qavg Qmax Qavg Qmax μ = 0.1 0.889 0.890 0.875 0.879 0.877 0.879 0.889 0.890 μ = 0.2 0.790 0.791 0.745 0.757 0.776 0.780 0.790 0.791 μ = 0.3 0.690 0.692 0.626 0.644 0.676 0.680 0.690 0.692 μ = 0.4 0.591 0.593 0.514 0.530 0.577 0.581 0.591 0.593 μ = 0.5 0.493 0.495 0.403 0.418 0.478 0.481 0.493 0.495 μ = 0.6 0.394 0.396 0.297 0.312 0.379 0.383 0.394 0.396 μ = 0.7 0.294 0.298 0.220 0.233 0.268 0.277 0.291 0.297 μ = 0.8 0.228 0.232 0.188 0.197 0.192 0.197 0.206 0.211 s1 and s2 are expected to be maximum on the networks of the same size with different μ. As can be seen from Table 5, s1 and s2 are essentially upper-bounded 1 ) of N respectively. Therefore, we can by N and a small fraction (about 10 safely set s1 = O(N ) and s2 = O(N ). The values of h are almost bounded by a small constant and therefore h can be estimated to be log(n), the depth of the dendrogram describing the hierarchical decomposition of a network with balanced hierarchical community structure into communities [18]. Therefore, the total computational complexity of meta-LPAm+ is O(mN ) + logN (O(mlogN ) + O(mN )) or equivalently, O(mN logN ). We also compare the performance of meta-LPAm+ with some existing modularity optimization algorithms. These algorithms use a single or a combination of node label propagation and merging communities to maximize network modularity. The compared algorithms include FastGreedy [4], which repeatedly Community Detection in Networks 77 Table 8. Comparison on the performance of meta-LPAm+, FastGreedy, MSG-VM and the Louvain method on the real-world networks. Network meta-LPAm+ Fast greedy MSG-VM Louvain Qavg Qmax Qavg Qmax Qavg Qmax Qavg Qmax Karate 0.415 0.420 Dolphins 0.381 0.381 0.401 0.420 0.419 0.419 0.525 0.529 0.495 0.495 0.522 0.522 0.519 0.519 Political books 0.527 0.527 0.502 0.502 0.520 0.521 0.520 0.520 Football 0.604 0.605 0.550 0.550 0.596 0.596 0.605 0.605 Jazz 0.445 0.445 0.439 0.439 0.444 0.445 0.443 0.443 C. elegans 0.445 0.453 0.404 0.404 0.439 0.440 0.441 0.441 Email 0.578 0.582 0.500 0.500 0.565 0.556 0.543 0.543 PGP 0.883 0.885 0.853 0.583 0.875 0.875 0.883 0.883 merges communities in pairs, MSG-VM [29], which iteratively combines merging multiple pairs of communities at a time and node label propagation, and the Louvain method [2], which is basically a multilevel node label propagation with each node at a level representing a community found in the previous level. Our implementation of MSG-VM follows [29] closely. The implementation of Fastgreedy and the Louvain method can be found in the public network software library igraph [5]. The source code of meta-LPAm+ can be downloaded at https://github.com/badungle/meta-LPAm plus. Tables 6 and 7 show the average network modularity and the maximum network modularity found by meta-LPAm+, FastGreedy, MSG-VM and the Louvain method on the LFR benchmark networks with N = 1000 and N = 5000 respectively. Table 8 shows the performance of meta-LPAm+ and the compared algorithms on the real-world networks. As can be seen from Tables 6 and 7, meta-LPAm+ detects communities with higher modularity than FastGreedy and MSG-VM on all the benchmark networks. Meta-LPAm+ results in almost the same modularity with the Louvain method on the LFR benchmark networks with μ < 0.7, where the network modularity is high above 0.3. However, meta-LPAm+ performs better than the Louvain method on the LFR benchmark networks with μ ≥ 0.7, where the network modularity is below 0.3. As for the real-world networks, meta-LPAm+ performs better than FastGreedy and MSG-VM on all the real-world networks. Meta-LPAm+ performs notably better than the Louvain method on the Dolphins network, the Political books network and the Email network while the network modularity found by meta-LPAm+ and the Louvain method are almost the same on the other real-world networks. It seems that after performing the first round of node label propagation, meta-LPAm+ and the Louvain method produce more misplaced nodes in the Dolphins network, the Political books network and the Email network than in the other networks. However, meta-LPAm+ is able to further adjust the misplaced nodes to improve network modularity at node level while the Louvain method can only further improve network modularity at community level. 78 B.-D. Le et al. 5 Conclusions The advanced modularity-specialized label propagation algorithm (LPAm+) addresses the local maxima issue of LPAm, a time-efficient modularity optimization method to detect communities in complex networks, by iteratively combining LPAm with community merging. In this paper, we first demonstrate that LPAm+ achieves excellent performance on networks with significant community structure but can easily get trapped in poor local maxima with many misplaced nodes on networks with less significant community structure. We then present an improved modularityspecialized label propagation, named meta-LPAm+, employing a meta-heuristic based label propagation procedure before merging communities to correct misplaced nodes in community solutions. The meta-heuristic based label propagation procedure is inspired by the Record-to-Record Travel algorithm for a balance between performance and complexity. The experimental results show that metaLPAm+ finds community divisions with higher modularity than LPAm+ on networks with less significant community structure while resulting in almost the same community detection solutions with LPAm+ on networks with significant community structure. Meta-LPAm+ also outperforms the other modularity optimization methods, including FastGreedy, MSG-VM and the Louvain method, on networks with less significant community structure. Meta-LPAm+, as a basic optimization method, can be applied to optimize other quality metrics of community structure such as the map equation [28]. Extending meta-LPAm+ to detect hierarchical community structures and overlapping communities in complex networks could be interesting future directions for research in community detection. Acknowledgments. The authors would like to thank the maintainers and contributors of the igraph packages used in this research. References 1. Barber, M.J., Clark, J.W.: Detecting network communities by propagating labels under constraints. Phys. Rev. E 80(2), 026129 (2009) 2. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008 (2008) 3. Blum, C., Roli, A.: Metaheuristics in combinatorial optimization: overview and conceptual comparison. ACM Comput. Surv. (CSUR) 35(3), 268–308 (2003) 4. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Phys. Rev. E 70(6), 066111 (2004) 5. Csardi, G., Nepusz, T.: The igraph software package for complex network research. InterJournal Complex Syst. 1695(5), 1–9 (2006) 6. Dueck, G.: New optimization heuristics: the great deluge algorithm and the recordto-record travel. J. Comput. Phys. 104(1), 86–92 (1993) 7. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010) Community Detection in Networks 79 8. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl Acad. Sci. 99(12), 7821–7826 (2002) 9. Gleiser, P.M., Danon, L.: Community structure in jazz. Adv. Complex Syst. 6(4), 565–573 (2003) 10. Gregory, S.: Finding overlapping communities in networks by label propagation. New J. Phys. 12(10), 103018 (2010) 11. Guimera, R., Amaral, L.A.N.: Cartography of complex networks: modules and universal roles. J. Stat. Mech. Theory Exp. 2005(2), P02001 (2005) 12. Guimera, R., Danon, L., Diaz-Guilera, A., Giralt, F., Arenas, A.: Self-similar community structure in a network of human interactions. Phys. Rev. E 68(6), 065103 (2003) 13. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., Barabsi, A.L.: The large-scale organization of metabolic networks. Nature 407(6804), 651–654 (2000) 14. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P., et al.: Optimization by simmulated annealing. Science 220(4598), 671–680 (1983) 15. Krebs, V.: A network of co-purchased books about us politics sold by the online bookseller amazon.com (2008). http://www.orgnet.com/ 16. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing community detection algorithms. Phys. Rev. E 78(4), 046110 (2008) 17. Leung, I.X., Hui, P., Lio, P., Crowcroft, J.: Towards real-time community detection in large networks. Phys. Rev. E 79(6), 066107 (2009) 18. Liu, X., Murata, T.: Advanced modularity-specialized label propagation algorithm for detecting communities in networks. Physica A Stat. Mech. Appl. 389(7), 1493– 1500 (2010) 19. Lusseau, D., Schneider, K., Boisseau, O.J., Haase, P., Slooten, E., Dawson, S.M.: The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behav. Ecol. Sociobiol. 54(4), 396–405 (2003) 20. Newman, M.E.J.: The structure of scientific collaboration networks. Proc. Natl Acad. Sci. 98(2), 404–409 (2001) 21. Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Phys. Rev. E 69(6), 066133 (2004) 22. Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl Acad. Sci. 103(23), 8577–8582 (2006) 23. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004) 24. Newman, M.E.: The structure and function of complex networks. SIAM Rev. 45(2), 167–256 (2003) 25. Newman, M.E., Girvan, M.: Mixing patterns and community structure in networks. In: Pastor-Satorras, R., Rubi, M., Diaz-Guilera, A. (eds.) Statistical Mechanics of Complex Networks, pp. 66–87. Springer, Heidelberg (2003) 26. Porter, M.A., Onnela, J.P., Mucha, P.J.: Communities in networks. Not. AMS 56(9), 1082–1097 (2009) 27. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76(3), 036106 (2007) 28. Rosvall, M., Axelsson, D., Bergstrom, C.T.: The map equation. Eur. Phys. J. Spec. Top. 178(1), 13–23 (2009) 29. Schuetz, P., Caflisch, A.: Efficient modularity optimization by multistep greedy algorithm and vertex mover refinement. Phys. Rev. E 77(4), 046112 (2008) 80 B.-D. Le et al. 30. Šubelj, L., Bajec, M.: Robust network community detection using balanced propagation. Eur. Phys. J. B Condens. Matter Complex Syst. 81(3), 353–362 (2011) 31. Zachary, W.W.: An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33, 452–473 (1977) 32. Zanetti, M.S., Schweitzer, F.: A network perspective on software modularity. In: ARCS Workshops (ARCS) 2012, pp. 1–8. IEEE (2012) Prediction-Based, Prioritized Market-Share Insight Extraction Renato Keshet, Alina Maor, and George Kour(B) Hewlett Packard Labs, Guthwirth Park, Technion, 32000 Haifa, Israel {renato.keshet,george.kour}@hpe.com, alina.maor@gmail.com http://www.labs.hpe.com/ Abstract. We present an approach for Business Intelligence (BI), where market share changes are tracked, evaluated, and prioritized dynamically and interactively. Out of all the hundreds or thousands of possible combinations of sub-markets and players, the system brings to the user those combinations where the most significant changes have happened, grouped into related insights. Time-series prediction and user interaction enable the system to learn what “significant” means to the user, and adapt the results accordingly. The proposed approach captures key insights that are missed by current top-down aggregative BI systems, and that are hard to be spotted by humans (e.g., Cisco’s US market disruption in 2010). 1 Introduction Business Intelligence (BI) is about providing decision makers with the critical information for running their businesses. Traditional BI systems typically provide reports, which list main summaries and business insights, and/or dashboards, which show aggregated indicators about the business “health.” These summaries and aggregations are key for informed decisions. There are often hundreds or thousands of ways one can aggregate events to look for significant indicators. For instance, if a sales table contains columns such as product, country, region, category, and vendor, then one could aggregate products by countries, or categories by regions, or regions by products AND categories, etc., all combinations possible. In this paper, we refer to each such aggregations as “players” in “markets”; e.g., if one checks the aggregation of a product “milk” in “Canada”, we call milk the “player” and Canada the “market”. Notice that “Canada” could also be a player in the “milk” market. What analysts usually do is to analyze players in a pre-selected, small number of markets. Each widget in a dashboard is typically a graph of the performances of players in a particular market. The pre-selected players and markets are typically high-level (e.g., products “world-wide”, in “Americas”, or in “Europe”, etc.; or regions for “Meat”, or for “Milk”, etc.). At each point in time, the user visually inspects one or a few markets, and if they visually see something interesting (strange, anomalous) there, they dig down into more specific markets, until they discover the root cause. c Springer International Publishing AG 2016  J. Li et al. (Eds.): ADMA 2016, LNAI 10086, pp. 81–94, 2016. DOI: 10.1007/978-3-319-49586-6 6 82 R. Keshet et al. This approach is known as Online Analytics Processing (OLAP) [1]. It is semi-manual in the sense that the aggregations are done automatically by the OLAP engine, but the choice of players and markets are made by the analyst. The space of all possible attribute combinations is called OLAP cube, and it can be explored with the aid of operations such as: “Slice”, “dice”, “drill-down”, “rollup”, and “pivot” [2]. BI vendors usually provide various ranges of dashboarding, reporting and OLAP capabilities. One of the difficulties with the traditional OLAP is that its top-down, semimanual nature often leads to analysts missing critical insights, because key events may be diluted or canceled out when aggregated. As a consequence, these events “fly under the radar” and are missed. This usually leads to costly, inaccurate decisions. Another difficulty is that the analyst needs to manually explore the different markets, by actively slicing and dicing, and thus navigating the maze of markets and sub-markets. It is not feasible to manually check all possible players in all possible markets events for information. Some works propose the use of analytics to help the user perform OLAP exploration. In [3,4], bottom-up approaches provide visual clues to the analyst the help them choose where to drill down. However, the solution still requires active guidance from the user. What we propose is an approach and a computer system for mining marketshare1 data, and providing the major changes/trends that are relevant to the user. Our analysis engine performs automatically all possible slicing and dicing of the OLAP cube, analyze the market-share behavior in each of its “faces”, and returns a prioritized list of anomalous players in specific markets. The main challenge is in computing a “significance score” for each player in each market, that can be used to return relevant insights to the analyst. The process is completely automatic, so the user does not have to explore the OLAP cube, as the engine digs deep into its faces, and mines the important information. The system is nevertheless interactive, enabling the user to choose the attributes for analysis, and tuning relevance weights for chosen markets and/or players. Differently from previous solutions, our approach analyzes player “shares” (rather than “values”) in each market, and perform the analysis based on timeseries predictors. The use of market-shares is advantageous because (1) it is often what interests the analyst, and (2) neutralizes player oscillations in oscillating markets. The time-series analysis enables us to provide the user with significant, unpredicted player changes period over period. Analyzing market-share series require non-standard modeling, as their samples are not normally distributed. Our paper is organized as follows. In Sect. 2 time-series predictors are briefly reviewed, and in Sect. 3 our solution is described. In Sect. 4, we dip dive into the analytics of the prediction-based “surprise factor” that is used in the computation of the “significance score”. Section 5 is devoted to the experiments performed and their results, and Sect. 6 provides the conclusion. 1 The share of a player in a market is the ratio between the volume of that player in that market and the total volume of that market. Prediction-Based, Prioritized Market-Share Insight Extraction 2 83 Review of Time-Series Predictors One can find many approaches for sequence prediction in the literature. In financial data analysis, the simplest predictor of a certain value is the zero-order hold predictor, in which the prediction of a sample is equal to the previous one. More advanced, but still fairly simple techniques, which work especially well for short time series, are linear regression, geometric, and exponential predictors. In all those cases, the forecast is a combination of weighted past samples. The weight coefficients themselves are either fixed (e.g., the predicted value is twice the last seen value minus the value seen a time-unit before), or learned from the time series past. With linear regression, one assumes that the series has a contextrelated, fixed slope, which may then be learned empirically. With geometric prediction, the weight coefficients are based on average growth rate, whereas in the exponential predictor case the prediction is a fixed linear combination of the previous sample and the previous prediction. The exponential predictor can be seen as a simplified version of the Autoregressive (AR) predictor, which constructs the forecast of the next values as a combination of few past values (context), learning the length of the context and the weights that should be given to each context samples from the entire sequence of the past values. In turn, the AR model is a component of a more generalized model: the Auto-Regressive Integrated Moving Average [6,7] (ARIMA). The ARIMA(p, d, q) model constructs the forecast of the next value as a linear combination of the last p input values, last q prediction errors, operating on d recurrent differences between the series values. The model topology (the p, q, and d parameters) is selected for each input series according to a given model selection criterion, the most common of which being the Akaike Information Criterion (see [8]). ARIMA-based schemes are considered fairly robust and strong for financial analysis, particularly for the quarterly forecasts [9,10]. However, ARIMA may be ineffective for short sequences, due to the training time and additional preprocessing effort, the length of the minimal required training context, the variety of optimization techniques, and mismatched criteria for selecting the best model. 3 Our Solution Our solution is based on two elements: (1) A prediction-based analytic engine that mines the data, and (2) an interactive interface that empowers the user to determine what is relevant. We have integrated those in a interactive market share analytic tool, which checks all market events, in all existing markets, detects the most significant market share changes, prioritizes those, groups related insights together, and returns the corresponding visual graphs. 3.1 Analytics Engine Consider a table of sales; each row is a deal, and each column is an attribute associated to that deal. Suppose at least one of those attributes is a “time 84 R. Keshet et al. period” where the deal occurred (e.g., week, month, quarter, etc.) – let At be that attribute, and T = {t1 , t2 , . . .} be the different data values for At . At least one additional attribute is a “deal value” which is used to measure market shares (e.g., revenue, number of units, etc.) – let Av be that attribute. Let A be the collection of all the attributes that are neither time period or value; we call those the “market attributes”. If A ∈ A is a market attribute, then Val(A) is the set of all values that attributes has in the data. E.g., if A is “country”, then “USA” could be an element of Val(A). Let a “market type” M = {A1 , A2 , . . .} be any subset of A, and a “market” m be any instance {A1 = a1 , A2 = a2 , . . .}, where ai ∈ Val(Ai ), ∀i. Let P be “player type in market M ” if M ⊂ P ⊆ A, and a player p in market m is of the format {A1 = a1 , A2 = a2 , . . .}, where m ⊂ p. E.g., m = {“country” = “USA”} and p = {“country” = “USA”, “OS” = “Windows”}. In practice, we say that p is the player “Windows” in the market “USA”. Notice that sometimes “USA” can be a market, sometimes a player in some market, all combinations being possible. Now, at a time t ∈ T , the volume of market m, denoted Vt (m), is the sum of all the values in column Av for the deals in that market. Similarly for defining the player volume. We call the ratio zt (p|m) = Vt (p)/Vt (m) the market share of p in m at time t. For every possible market m, every possible player p within m, and every time t, we compute a “significance” score ρt (p, m) to the associated market share change. This significance score is of the form: ρt (p, m) = St (p|m) · Rt (p|m) · Wp,m , (1) where: – St (p|m) is a “surprise factor”. This indicates how the engine “is surprised” by the occurrence of the market share zt (p|m), given all the knowledge it has about previous values of that player’s share in that market, and other players’ in the same market. For instance, if a value zt (p|m) is significantly different from all the values in {zt (p|m)}t