Outlier Detection from Time-series Data Stream by Leveraging Change Detection

Published: 2021-06-28 01:05:04
essay essay

Category: Computer sciences

Type of paper: Essay

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Hey! We can write a custom essay for you.

All possible types of assignments. Written by academics

GET MY ESSAY
Motivation: Outlier detection is one of the most interesting areas in the context of data mining. It has many applications such as intrusion detection, medical anomaly detection, sensor anomaly detection etc. Detecting outlier is challenging in various new data types such as data stream, spatio temporal and time series data. Effective and efficient methods are needed to tackle these challenges. Identifying and analyzing outlier in a given time-series is an important in many applications, because peaks are useful topological features of a time-series. In power distribution data, peaks indicate sudden high demands. In server CPU utilization data, peaks indicate sharp increase in workload. In network data, peaks correspond to bursts in traffic. In financial data, peaks indicate abrupt rise in price or volume. Outlier detection has been used for ages to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behavior, fraudulent behavior, instrument error etc. In this paper, we are proposing a method to identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The previous outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we propose a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review. Outlier or anomaly detection is a general challenge for computer science. It can cause many difficulties which is hard to solve. In big systems, outlier detection is very important and affects a lot in the total system. There are some effective algorithms for detecting anomaly despite of causing any kind of change in the system. But there is a need of some cost effective and faster algorithm to solve this system. So, I think developing a system that can detect anomaly or outlier effectively and correctly, in a short period of time will be very helpful for the field of data mining. I find this topic not only challenging but also extremely interesting and helpful to do my research on. Research Proposal: Introduction: Outlier detection is one of the most interesting areas in the context of Data Mining/Knowledge discovery. Outlier detection is also referred to as anomaly detection, event detection, novelty detection, deviant discovery, fault detection, intrusion detection, or misuse detection [GGAH14]. Moreover, a subtle difference between the definitions of outlier and anomaly is mentioned in [Agg13b, p. 4]: ‘outlier refers to a data point, which could either be considered an abnormality or noise, whereas an anomaly refers to a special kind of outlier, which is of interest to an analyst.” Figure 1: The spectrum from normal data to outliers[Agg13b] Here, we will use the term outlier and anomaly interchangeably. Some well established definitions of outliers are: An outlying observation or `outlier’ is one that appears to deviate markedly from other members of the sample in which it occurs.” [Gru69] An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” [Haw80] an observation (or a set of observations) which appears to be inconsistent with the remainder of that set of data” [BL94] These seemingly vague definitions cover a broad spectrum for outliers which provide the opportunity to define outlier differently in various application domains. As a result, outlier detection is the process to effectively detect outliers based on the particular definition of the outlier. It is highly unlikely to find a general purpose outlier detection technique. Several books provide an extensive overview of this field. [HKP11, Ch. 12] and [Agg15, Ch. 9- 10] provide a broad overview on outlier detection. But the most comprehensive book for outlier detection is [Agg13b]. There are also several excellent surveys in the literature like [HA04, CBK09, KKZ09]. Some surveys are more focused on particular domain. [ZMH10, MSME15] cover outlier detection methods for wireless sensor networks. Figure 2: Taxonomy of outlier detection in WSN[ZMH10] [CBK12] covers the topics related to discrete sequences. [SG14] provides the research issues of outlier detection for data streams. For temporal/time-series data, [Fu11, EA12, GGAH14] provide a detail overview of the topic. Figure 3: Taxonomy of outlier detection in temporal data[GGAH14] Moreover, [Gam10, Ch. 11] and [Agg13b, Ch. 8] provides an overview of outlier detection for time-series data streams. In general, Outlier detection techniques can be categorized into several groups: (i) statistical methods; (ii) Nearest neighbor methods; (iii) Classification methods; (iv) Clustering methods; (v) Information theoretic methods and (vi) Spectral decomposition methods [CBK09, ZMH10]. on the other hand [KKZ09] has categorized outlier detection techniques into(i) statistical test; (ii) Depth-based methods; (iii) Deviation-based methods; (iv) Distance-based methods; (v) Density-based methods and (vi) High-dimensional methods. Each method has its strength and weakness. Choosing a method largely depends on the application domain. It has been identified that an anomaly detection problem has four main aspects [CBK09]. Firstly, the nature of data such as univariate vs. multivariate; discrete vs. continuous. Secondly, based on the availability of data labels, anomaly detection problem can be treated using a supervised/semi supervised/ unsupervised method. Thirdly, anomalies are divided into three types: point, contextual and collective. Recently a new type of anomaly called contextual collective anomaly has been proposed in [JZXL14]. Finally, output of an anomaly detection method is generated as scores or labels. Recently, the research direction of outlier detection is moving towards “Outlier Ensembles” after the inuential paper of the same title by Charu Aggarwal [Agg13c]. Moreover, [ZCS14] has extended the research issues for outlier ensembles with a focus on unsupervised methods. [MMA14] emphasizes using techniques from both supervised and unsupervised approaches to leverage the idea of outlier ensembles. Literature Review: Data Stream vs. Time-series: Data stream has brought a new kind of setting in computing: processing a stream of data as opposed to static, multiple-access data. Data streams are temporally ordered, fast changing and potentially infinite. Wireless sensor network traffic, telecommunications, on-line transactions in the financial market or retail industry, web click streams, video surveillance, and weather or environment monitoring are some sources of data stream. As these kinds of data cannot be stored in any kind of data repository, effective and efficient management and online analysis of data streams brings new challenges. Knowledge discovery from data stream is a broad topic which is covered in several books like [Agg07, Gam10], [LRU14, Ch. 4], [Agg15, Ch. 12]. As sensor data is one of the sources of data stream, extensive analysis from this perspective can be found in [GGO+08, Agg13a]. In many application domains data stream includes a temporal attribute where each data point has either implicit or explicit timestamp with it. Real time sensor data, medical data, mechanical system diagnosis are such examples. These are also example of Time-series data. Traditionally it is assumed that time series data can be stored easily and established online analysis and mining methods can be applied. But in a streaming setting, the focus is shifted towards online data mining. This requirement makes the online algorithms infeasible. In [Agg13b, p. 260], it is identified that the problem of outlier detection in streaming time series data and multidimensional data streams are very different. The former requires the analysis of each series as a unit, whereas the latter requires the analysis of each multidimensional point as a unit. Outlier detection in a time-series can be divided into two categories: values at specific time stamps are classified as outliers because of sudden changes (contextual anomalies), or entire time-series or large subsequences within a time series are classified as outliers because of their unusual shapes (collective anomalies) [Agg13b, p. 227]. Jointly, we are interested to use the term time-series data stream or streaming time series data interchangeably. Time-series Data Stream: We are really motivated by three research issues provided in the context of data stream [SG14]: ‘Research Issue 2- A data point has to be compared with the other data points with same temporal context (occurred within the time period which is semantically related to the timestamp of the data point).” ‘Research Issue 6- An outlier detection technique for data streams should not assume any kind of fixed data distribution.” ‘Research Issue 14- An outlier detection technique for multiple data streams should be able to compare data points with the same or different schemas in order to detect outliers.” Change Detection in Data Stream: Another important task in processing of time-series data streams is change detection. For temporal data, the task of change detection is closely related with anomaly detection but different: It should be emphasized that change analysis and outlier detection(in temporal data) are very closely related areas, but not necessarily identical” [Agg13b, p. 25]. Figure 4: Different types of Change[GZB+14] The following different modes of change have been identified in the literature: concept drift (gradual change) and concept shift (abrupt change). [Gam10, Ch. 3] and [Agg07, Ch. 5] provide separate chapter to cover change detection for data streams. Detecting concept drift is more difficult than concept shift. [SG09, G_ZB+14] provides an extensive overview for detecting concept change. In contrast with anomaly detection, for concept drift detection two distributions are being compared, rather than comparing a given data point against a model prediction. Here, a sliding window of most recent examples is usually maintained, which is then compared against the learned hypothesis or performance indicators, or even just a previous time window. Much of the difference between the algorithms below is in the way the sliding windows of recent examples are maintained and in the types of statistical tests performed (except for CVFDT), though some algorithms, notably ADWIN family, allow different statistical tests to be used. In particular, statistical tests range from a comparison of means of old and new data, to order statistics [KBDG04], sequential hypothesis testing [MvdBW07], velocity density estimation [Agg03], density test method [SWJR07], to Kullback Leibler (KL) divergence [DKVY06]. Many of the results specifically address multidimensional data. Different tests are suitable for different situations; in [DKP11] a comparison of applicability of several of the above mentioned tests is made. The following are a sample of algorithms for detecting concept drift. There has been publicly available implementations of some of them: in particular, the MOA software environment for online learning of evolving data stream (http://moa.cms.waikato.ac.nz/) incorporates ADWIN (family of) algorithms mentioned below. 1. CUSUM/PH test: Probably the oldest algorithm for change detection, CUSUM maintains a mean of (adjusted) examples seen so far: g0 = 0 and gt = max(0; gt-1 + (rt – v)) in its simplest form (assuming only positive change). Whenever the cumulative sum gt exceeds a given threshold, a change is detected. A similar idea with a different cumulative variable is used in Page-Hinkley (PH) test. 2. CVFDT: The CVFDT [HSD01] algorithm is an early algorithm that proposed an incremental approach for building and maintaining a decision tree (Hoeffding tree) in the face of changes or concept drift that occur in a data stream environment. This algorithm does not need an external classifier, checking the incoming data against the decision tree it is maintaining; when that tree does not adequately describe the data, a switch to an alternative tree is made. There is a number of implementations available. 3. ADWIN: A common theme amongst change detection algorithms is maintaining a sliding window of new or relevant data. Bifet et al. [BG07] proposed an adaptive windowing scheme called ADWIN; the second version ADWIN2 is now available, as well as a version with Kalman filter. In ADWIN, the detection of change is based on statistical methods, in particular on the use of the Hoeffding bound. An implementation of ADWIN is available at http://adaptive-mining.sourceforge.net/?page_id=20; ADWIN and k-ADWIN are incorporated into http://moa.cms.waikato.ac.nz/. 4. OnePassSampler: Recently, a faster algorithm has been proposed named OnePassSampler [SPK13]. This algorithm does not do the extensive within-window comparisons of ADWIN, but it uses a sequential hypothesis testing strategy. The statistical test involves computing sample means and using Bernstein bound to estimate the error. It seems to have good performance in terms of false positive/true positive rate, however its detection delay is higher. Proposed Research Methodology: Contextual(Point) Anomaly Detection Framework: Input: A univariate time-series data stream X = {x1, x2, x3′, xt-1, xt,’.} where each measurement has a explicit/implicit timestamp associated with it. Output: Decide whether xt + 1 is an anomaly (based on the definition of anomaly for the specific domain). Assumptions: i) No ground truth is available which makes supervised techniques less applicable. ii) Near real-time anomaly detection is needed which makes offline methods infeasible. That is detection xt+1 must be performed before the arrival of xt+2. iii) Considering domains where data arrival rate is within certain limit. This has made the second assumption fairly relaxed. Contextual anomaly detection methods for the aforementioned setting are typically deviation based [Agg13b, p. 229]. But we are interested to use a non-parametric statistical method within a sliding window for online anomaly detection. Moreover, we are interested to use a external change detection mechanism for detecting Concept Drift (gradual change) so that we can adapt the change of underlying data distribution to detect anomalies. Unified techniques for change point and outlier detection are presented in [TY06, KS09, SZLH13]. But using change detection mechanism for outlier detection is presented in [BP_Z+09, PB_Z+10]. But the primary motivation of the work was not anomaly detection rather better prediction of the model in the presence of concept drift (further review needed). On the other hand, we are interested to adapt the general framework for model prediction in[PB_Z+10] with slight modification: Input: X = {x1, x2, x3, ‘., xt-1, xt, ‘..}. 1) Use ADWIN-2[BG07] to detect the concept change point c (issues: replace outliers and normalization). 2) Learn the model F(x) from Xnew = {xc,’.., xt}. 3) May use different value of confidence parameter _ for ensembles. That is our outlier detection framework will be: 1) Remove obvious outlier from Xnew using Z-value test(or other suitable method) to make next model more robust[Agg13b, p. 125]. 2) Apply non-parametric statistical method such as Kernel Density Estimation (KDE) [Sil86] to detect anomaly. 3) Use only the first window of data as training set to model normal behavior with respect to the context (within window). 4) General KDE algorithm has a O(n2) computational complexity. But once the model is learned, the computational cost of outlier detection for each item is very low. May need to use more efficient method. We see the following issues and questions for research: The research questions are: ‘ How the system will work more accurately? ‘ How can the system be more efficient? ‘ How the system differ from other algorithms? ‘ How to deal with the changes appearance in time? The Sub questions are: ‘ Would the system be user friendly? ‘ Would the system be cost effective? ‘ Would the system be able to find exactly correct results? References: [Agg03] Charu C Aggarwal. A framework for diagnosing changes in evolving data streams. In Proceedings of the 2003 ACM SIGMOD international conference on Manage- ment of data, pages 575{586. ACM, 2003. [Agg07] Charu C Aggarwal. Data streams: models and algorithms, volume 31. Springer, 2007. [Agg13a] Charu C Aggarwal. Managing and mining sensor data. Springer Science & Business Media, 2013. [Agg13b] Charu C Aggarwal. Outlier analysis. Springer Science & Business Media, 2013. [Agg13c] Charu C Aggarwal. Outlier ensembles: position paper. ACM SIGKDD Explo- rations Newsletter, 14(2):49{58, 2013. [Agg15] Charu C Aggarwal. An introduction to data mining. In Data Mining, pages 1{26. Springer, 2015. [BG07] Albert Bifet and Ricard Gavalda. Learning from time-changing data with adaptive windowing. In SDM, volume 7, page 2007. SIAM, 2007. [BL94] Vic Barnett and Toby Lewis. Outliers in statistical data, volume 3. Wiley New York, 1994. [BP_Z+09] Jorn Bakker, Mykola Pechenizkiy, I _Zliobait_e, Andriy Ivannikov, and Tommi Karkkainen. Handling outliers and concept drift in online mass ow prediction in cfb boilers. In Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data, pages 13{22. ACM, 2009. [CBK09] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3):15, 2009. [CBK12] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection for discrete sequences: A survey. Knowledge and Data Engineering, IEEE Transac- tions on, 24(5):823{839, 2012. [DKP11] Tamraparni Dasu, Shankar Krishnan, and Gina Maria Pomann. Robustness of change detection algorithms. In Advances in Intelligent Data Analysis X, pages 125{137. Springer, 2011. [DKVY06] Tamraparni Dasu, Shankar Krishnan, Suresh Venkatasubramanian, and Ke Yi. An information-theoretic approach to detecting changes in multi-dimensional data streams. In In Proc. Symp. on the Interface of Statistics, Computing Science, and Applications, 2006. [EA12] Philippe Esling and Carlos Agon. Time-series data mining. ACM Computing Surveys (CSUR), 45(1):12, 2012. [Fu11] Tak-chung Fu. A review on time series data mining. Engineering Applications of Arti_cial Intelligence, 24(1):164{181, 2011. [Gam10] Jo~ao Gama. Knowledge Discovery from Data Streams. Chapman and Hall / CRC Data Mining and Knowledge Discovery Series. CRC Press, 2010. [GGAH14] Manish Gupta, Jing Gao, Charu Aggarwal, and Jiawei Han. Outlier detection for temporal data. Synthesis Lectures on Data Mining and Knowledge Discovery, 5(1):1{129, 2014. [GGO+08] Auroop R Ganguly, Joao Gama, Olufemi A Omitaomu, Mohamed Gaber, and Ranga Raju Vatsavai. Knowledge discovery from sensor data. CRC Press, 2008. [Gru69] Frank E Grubbs. Procedures for detecting outlying observations in samples. Tech- nometrics, 11(1):1{21, 1969. [G_ZB+14] Jo~ao Gama, Indr_e _Zliobait_e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4):44, 2014. [HA04] Victoria J Hodge and Jim Austin. A survey of outlier detection methodologies. Arti_cial Intelligence Review, 22(2):85{126, 2004. [Haw80] Douglas M Hawkins. Identi_cation of outliers, volume 11. Springer, 1980. [HKP11] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Tech- niques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011. [HSD01] Geo_ Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 97{106. ACM, 2001. [JZXL14] Yexi Jiang, Chunqiu Zeng, Jian Xu, and Tao Li. Real time contextual collective anomaly detection over multiple data streams. 2014. [KBDG04] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 180{191, 2004. [KKZ09] Hans-Peter Kriegel, Peer Kroger, and Arthur Zimek. Outlier detection techniques. In Tutorial at the 13th Paci_c-Asia Conference on Knowledge Discovery and Data Mining, 2009. [KS09] Yoshinobu Kawahara and Masashi Sugiyama. Change-point detection in timeseries data by direct density-ratio estimation. In SDM, volume 9, pages 389{400. SIAM, 2009. [LRU14] Jure Leskovec, Anand Rajaraman, and Je_rey David Ullman. Mining of massive Datasets Cambridge University Press, 2014. [MMA14] Barbora Micenkov_a, Brian McWilliams, and Ira Assent. Learning outlier ensembles: The best of both worlds{supervised and unsupervised. 2014. [MSME15] Dylan McDonald, Stewart Sanchez, Sanjay Madria, and Fikret Ercal. A survey of methods for _nding outliers in wireless sensor networks. Journal of Network and Systems Management, 23(1):163{182, 2015. [MvdBW07] S Muthukrishnan, Eric van den Berg, and Yihua Wu. Sequential change detection on data streams. In Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on, pages 551{550. IEEE, 2007. [PB_Z+10] Mykola Pechenizkiy, Jorn Bakker, I _Zliobait_e, Andriy Ivannikov, and Tommi Karkkainen. Online mass ow prediction in cfb boilers with explicit detection of sudden concept drift. ACM SIGKDD Explorations Newsletter, 11(2):109{116, 2010. [SG09] Raquel Sebastiao and Joao Gama. A study on change detection methods. In 4th Portuguese Conf. on Arti_cial Intelligence, Lisbon, 2009. [SG14] Shiblee Sadik and Le Gruenwald. Research issues in outlier detection for data streams. ACM SIGKDD Explorations Newsletter, 15(1):33{40, 2014. [Sil86] Bernard W Silverman. Density estimation for statistics and data analysis, volume 26. CRC press, 1986. [SPK13] Sripirakas Sakthithasan, Russel Pears, and Yun Sing Koh. One pass concept change detection for data streams. In Advances in Knowledge Discovery and Data Mining, pages 461{472. Springer, 2013. [SWJR07] Xiuyao Song, Mingxi Wu, Christopher Jermaine, and Sanjay Ranka. Statistical change detection for multi-dimensional data. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 667{676. ACM, 2007. [SZLH13] Wei-xing Su, Yun-long Zhu, Fang Liu, and Kun-yuan Hu. On-line outlier and change point detection for time series. Journal of Central South University, 20:114{122, 2013. [TY06] Jun-ichi Takeuchi and Kenji Yamanishi. A unifying framework for detecting outliers and change points from time series. Knowledge and Data Engineering, IEEE Transactions on, 18(4):482{492, 2006. [ZCS14] Arthur Zimek, Ricardo JGB Campello, and Jorg Sander. Ensembles for unsupervised outlier detection: challenges and research questions a position paper. ACM SIGKDD Explorations Newsletter, 15(1):11{22, 2014. [ZMH10] Yang Zhang, Nirvana Meratnia, and Paul Havinga. Outlier detection techniques for wireless sensor networks: A survey. Communications Surveys & Tutorials, IEEE, 12(2):159{170, 2010. ..

Warning! This essay is not original. Get 100% unique essay within 45 seconds!

GET UNIQUE ESSAY

We can write your paper just for 11.99$

i want to copy...

This essay has been submitted by a student and contain not unique content

People also read