Like all clustering methods, the SOM requires a measure of similarity between input data (in this work time series). The Kohonen self-organizing map (SOM) is a type of unsupervised artificial neural network for visualizing and clustering complex data, reducing the dimensionality of data, and selecting influential features. ![]() There is an increasing demand for scalable algorithms capable of clustering and analyzing large time series datasets. The algorithms achieve horizontal scalability for any data size, and the systems are currently deployed at the University of New Mexico. I explain how to exploit the data properties in a distributed knowledge discovery system to achieve scalability and speed. In this dissertation, I describe three domain specific knowledge discovery systems on three diverse domains: a distributed algorithm to extract patterns from log messages generated by computers, a distributed algorithm to find abnormal behavior in social media, and a scalable algorithm for matching patterns in streaming time series data. Data properties such as origin, type, context and size play important roles to achieve speed, efficiency and scalability. The main goal of this dissertation is to demonstrate the importance of domain specific knowledge in developing scalable knowledge discovery algorithms on distributed systems. Researchers should design the modern distributed algorithms based on the problem domain. The transition from traditional algorithms to the ones that can be run on a distributed platform should be done carefully. However, distributed systems have come to aid this problem while introducing new challenges in designing scalable algorithms. It is non-trivial to extract knowledge from big datasets because traditional data mining algorithms run impractically on such big datasets. Smart cities, social networks, health care systems, large sensor networks, etc. In the era of new technologies, computer scientists deal with massive data of size hundreds of terabytes. We conducted an extensive experimental evaluation on synthetic and real-world datasets, which illustrates that our algorithm outperforms the brute-force method and MSM, a multi-step filter mechanism over the multi-scaled representation, by orders of magnitude. ![]() In post-processing phase, we provide an algorithm to further examine the possible matches in linear complexity. In pruning phase, we propose ELB(Equal Length Block) Representation and BSP (Block-Skipping Pruning) policy, which efficiently filter the unmatched subsequence with the guarantee of no-false dismissals. In this paper, we propose a novel 2-phase approach to solve this problem. It allows users to define varied deviations to different segments of a given pattern, and fuzzy breakpoint of adjunct segments, which urges the dramatically increased complexity against traditional pattern matching problem over stream. ![]() To tackle the real world challenge in this area, like equipment health monitoring by comparing the incoming data stream with known fault patterns, we formulate a new problem, called "fine-grained pattern matching". ![]() Processing of streaming time series data from sensors with lower latency and limited computing resource comes to a critical problem as the growth of Industry 4.0 and Industry Internet of Things(IIoT).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |