A spark-based parallel distributed posterior decoding algorithm for big data hidden Markov models decoding problem

Published on Sep 1, 2021in IAES International Journal of Artificial Intelligence
· DOI :10.11591/IJAI.V10.I3.PP789-800
Imad Sassi4
Estimated H-index: 4
Samir Anter5
Estimated H-index: 5
Abdelkrim Bekkhoucha2
Estimated H-index: 2
Hidden  null M null arkov models (HMMs) are one of machine learning algorithms which have been widely used and demonstrated their efficiency in many conventional applications. This paper proposes a modified posterior decoding algorithm to solve hidden Markov models decoding problem based on MapReduce paradigm and spark’s resilient distributed dataset (RDDs) concept, for large-scale data processing. The objective of this work is to improve the performances of HMM to deal with big data challenges. The proposed algorithm shows a great improvement in reducing time complexity and provides good results in terms of running time, speedup, and parallelization efficiency for a large amount of data, i.e., large states number and large sequences number.
📖 Papers frequently viewed together
2010CIT: Computer and Information Technology
12 Citations
#1Bhavya MorH-Index: 2
#2Sunita GarhwalH-Index: 5
Last. Ajay Kumar (Thapar University)H-Index: 16
view all 3 authors...
The hidden Markov models are statistical models used in many real-world applications and communities. The use of hidden Markov models has become predominant in the last decades, as evidenced by a large number of published papers. In this survey, 146 papers (101 from Journals and 45 from Conferences/Workshops) from 93 Journals and 44 Conferences/Workshops are considered. The authors evaluate the literature based on hidden Markov model variants that have been applied to various application fields....
20 CitationsSource
#1Hao-Chun LuH-Index: 9
#2F. J. Hwang (UTS: University of Technology, Sydney)H-Index: 8
Last. Yao-Huei Huang (FJU: Fu Jen Catholic University)H-Index: 7
view all 3 authors...
Abstract The genetic algorithm (GA), one of the best-known metaheuristic algorithms, has been extensively utilized in various fields of management science, operational research, and industrial engineering. The efficiency of GAs in solving large-scale optimization problems would be enhanced if the iterative processes required by the genetic operators can be implemented in a parallel and distributed computing architecture. Apache Hadoop has recently been one of the most popular systems for distrib...
8 CitationsSource
#1Uma Narayanan (CUSAT: Cochin University of Science and Technology)H-Index: 4
#2Varghese Paul (Rajagiri)H-Index: 6
Last. Shelbi Joseph (CUSAT: Cochin University of Science and Technology)H-Index: 5
view all 3 authors...
Data is growing exponentially in the fast Changing World of Information and Communications Technology. Information from sensors, cell phones, social networking sites, logical information and ventures all are adding to this gigantic blast in the information. One of the best mainstream utilities available for dealing with the colossal measure of data is the Hadoop community. Enterprises are progressively depending on Hadoop for preparing their essential information. In any case, Hadoop is still de...
6 CitationsSource
#1Ahmed Yaseen Mjhool (University of Kufa)H-Index: 1
#2Ahmed Hazim Alhilali (University of Kufa)H-Index: 1
Last. Salam Al-augby (University of Kufa)H-Index: 2
view all 3 authors...
Nowadays, educational data have been increased rapidly because of the online services provided for both students and staff. University of Kufa (UoK) generates a massive amount of data annually due to the use of e-learning web-based systems, network servers, Windows applications, and Students Information System (SIS). This data is wasted as traditional management software are not capable to analysis it. As a result, the Big Educational Data concept rises to help education sectors by providing new...
3 CitationsSource
#1Imad SassiH-Index: 4
#2Samir AnterH-Index: 5
5 CitationsSource
#1Zryan Najat RashidH-Index: 4
#2Subhi R. M. ZebariH-Index: 1
Last. Karwan Jacksi (University of Zakho)H-Index: 7
view all 4 authors...
In this paper, we present a discussion panel of two of the hottest topics in this area namely distributed parallel processing and distributed cloud computing. Various aspects have been discussed in this review paper such as concentrating on whether these topics are discussed simultaneously in any previous works. Other aspects that have been reviewed in this paper include the algorithms, which simulated in both distributed parallel computing and distributed cloud computing. The goal is to process...
27 CitationsSource
#1Rong Gu (NU: Nanjing University)H-Index: 9
#2Yun Tang (NU: Nanjing University)H-Index: 3
Last. Yihua Huang (NU: Nanjing University)H-Index: 15
view all 7 authors...
Matrix multiplication is a dominant but very time-consuming operation in many big data analytic applications. Thus its performance optimization is an important and fundamental research issue. The performance of large-scale matrix multiplication on distributed data-parallel platforms is determined by both computation and IO costs. For existing matrix multiplication execution strategies, when the execution concurrency scales up above a threshold, their execution performance deteriorates quickly be...
13 CitationsSource
#1Xiangrui MengH-Index: 18
#2Joseph K. BradleyH-Index: 12
Last. Ameet Talwalkar (UCLA: University of California, Los Angeles)H-Index: 46
view all 16 authors...
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's ...
1,001 Citations
Apr 25, 2012 in NSDI (Networked Systems Design and Implementation)
#1Matei Zaharia (University of California, Berkeley)H-Index: 56
#2Mosharaf Chowdhury (University of California, Berkeley)H-Index: 22
Last. Ion Stoica (University of California, Berkeley)H-Index: 141
view all 9 authors...
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form...
3,196 Citations
#1Andreas Sand (AU: Aarhus University)H-Index: 7
#2Christian N. S. Pedersen (AU: Aarhus University)H-Index: 30
Last. Asbjorn Tolbol BraskH-Index: 1
view all 4 authors...
We present a C++ library for constructing and analyzing general hidden Markov models. The library consists of a number of template classes and generic functions, parameterized with the precision of floating point types and different types of hardware acceleration.
14 CitationsSource
Cited By0