Performance analysis of data intensive cloud systems based on data management and replication: a survey

Published on Jun 1, 2016in Distributed and Parallel Databases0.757
· DOI :10.1007/S10619-015-7173-2
Saif Ur Rehman Malik16
Estimated H-index: 16
(CUI: COMSATS Institute of Information Technology),
Samee U. Khan69
Estimated H-index: 69
(NDSU: North Dakota State University)
+ 14 AuthorsHongxiang Li17
Estimated H-index: 17
(University of Louisville)
As we delve deeper into the `Digital Age', we witness an explosive growth in the volume, velocity, and variety of the data available on the Internet. For example, in 2012 about 2.5 quintillion bytes of data was created on a daily basis that originated from myriad of sources and applications including mobile devices, sensors, individual archives, social networks, Internet of Things, enterprises, cameras, software logs, etc. Such `Data Explosions' has led to one of the most challenging research issues of the current Information and Communication Technology era: how to optimally manage (e.g., store, replicated, filter, and the like) such large amount of data and identify new ways to analyze large amounts of data for unlocking information. It is clear that such large data streams cannot be managed by setting up on-premises enterprise database systems as it leads to a large up-front cost in buying and administering the hardware and software systems. Therefore, next generation data management systems must be deployed on cloud. The cloud computing paradigm provides scalable and elastic resources, such as data and services accessible over the Internet Every Cloud Service Provider must assure that data is efficiently processed and distributed in a way that does not compromise end-users' Quality of Service (QoS) in terms of data availability, data search delay, data analysis delay, and the like. In the aforementioned perspective, data replication is used in the cloud for improving the performance (e.g., read and write delay) of applications that access data. Through replication a data intensive application or system can achieve high availability, better fault tolerance, and data recovery. In this paper, we survey data management and replication approaches (from 2007 to 2011) that are developed by both industrial and research communities. The focus of the survey is to discuss and characterize the existing approaches of data replication and management that tackle the resource usage and QoS provisioning with different levels of efficiencies. Moreover, the breakdown of both influential expressions (data replication and management) to provide different QoS attributes is deliberated. Furthermore, the performance advantages and disadvantages of data replication and management approaches in the cloud computing environments are analyzed. Open issues and future challenges related to data consistency, scalability, load balancing, processing and placement are also reported.
Figures & Tables
📖 Papers frequently viewed together
89 Citations
102 Citations
69 Citations
#1Lizhe WangH-Index: 57
#2Wei JieH-Index: 13
Last. Jinjun Chen (RIT: Rochester Institute of Technology)
view all 3 authors...
Identifies Recent Technological Developments WorldwideThe field of grid computing has made rapid progress in the past few years, evolving and developing in almost all areas, including concepts, philosophy, methodology, and usages. Grid Computing: Infrastructure, Service, and Applications reflects the recent advances in this field, covering the research aspects that involve infrastructure, middleware, architecture, services, and applications.Grid Systems Across the GlobeThe first section of the b...
38 Citations
Dec 1, 2013 in CloudCom (IEEE International Conference on Cloud Computing Technology and Science)
#1Gabriel Loewen (UA: University of Alabama)H-Index: 3
#2Jeffrey Galloway (UA: University of Alabama)H-Index: 2
Last. Susan V. Vrbsky (UA: University of Alabama)H-Index: 17
view all 5 authors...
As federal funding in many public non-profit organizations (NPO's) seems to be dwindling, it is of the utmost importance that efforts are focused on reducing operating costs of needy organizations, such as public schools. Our approach for reducing organizational costs is through the combined benefits of a high performance cloud architecture and low-power, thin-client devices. However, general-purpose private cloud architectures are not easily deployable by average users, or even those with some ...
1 CitationsSource
#1Javid Taheri (USYD: University of Sydney)H-Index: 24
#2Albert Y. Zomaya (USYD: University of Sydney)H-Index: 81
Last. Samee U. Khan (NDSU: North Dakota State University)H-Index: 69
view all 4 authors...
This paper presents a novel heuristic approach, named JDS-HNN, to simultaneously schedule jobs and replicate data files to different entities of a grid system so that the overall makespan of executing all jobs as well as the overall delivery time of all data files to their dependent jobs is concurrently minimized. JDS-HNN is inspired by a natural distribution of a variety of stones among different jars and utilizes a Hopfield Neural Network in one of its optimization stages to achieve its goals....
15 CitationsSource
#1Dzmitry Kliazovich (University of Luxembourg)H-Index: 24
#2Pascal Bouvry (University of Luxembourg)H-Index: 38
Last. Samee U. Khan (NDSU: North Dakota State University)H-Index: 69
view all 3 authors...
Cloud computing data centers are becoming increasingly popular for the provisioning of computing resources. The cost and operating expenses of data centers have skyrocketed with the increase in computing capacity. In this chapter, we survey the main techniques behind enabling energy efficiency in data centers and present simulation environment for energy-aware cloud computing. Along with the workload distribution, the focus is devoted to simulating packet-level communications in realistic setups...
19 CitationsSource
Dec 17, 2012 in FIT (Frontiers of Information Technology)
#1Hamed S. Kia (NDSU: North Dakota State University)H-Index: 3
#2Samee U. Khan (NDSU: North Dakota State University)H-Index: 69
This paper studies and proposes heuristic algorithms to solve the problem of replicated server placement (RSP) with Quality of Service (QoS) constraints. Although there has been much work on RSP in multicast networks, in most of them a simplified replication model is used, therefore, their proposed solutions may not be applicable to real systems. In this paper, we use a more realistic, and generalized model for replica placement, which considers the latency restriction of the receivers (QoS), ba...
3 CitationsSource
Dec 17, 2012 in ICPADS (International Conference on Parallel and Distributed Systems)
#1Guthemberg Silvestre (UPMC: Pierre-and-Marie-Curie University)H-Index: 6
#2Sébastien Monnet (UPMC: Pierre-and-Marie-Curie University)H-Index: 12
Last. Pierre Sens (UPMC: Pierre-and-Marie-Curie University)H-Index: 38
view all 4 authors...
Delivering on-demand web content to end-users in order to carry out strict QoS metrics is not a trivial task for globally distributed network providers. This task becomes still harder when content popularity varies over the time and the SLA definitions have to include both transfer rate and latency metrics. Current worldwide content delivery approaches and datacenter infrastructures rely on cumbersome replication schemes that are agnostic to edge-network resources, and damage content provision. ...
18 CitationsSource
#2Keren BergmanH-Index: 67
Last. Ioannis TomkosH-Index: 47
view all 3 authors...
Optical Interconnects in Future Data Center Networks covers optical networks and how they can be used to provide high bandwidth, energy efficient interconnects for future data centers with increased communication bandwidth requirements. This contributed volume presents an integrated view of the future requirements of the data centers and serves as a reference work for some of the most advanced solutions that have been proposed by major universities and companies. Collecting the most recent and i...
68 CitationsSource
#1Samee U. Khan (NDSU: North Dakota State University)H-Index: 69
#2Nasro Min-Allah (CUI: COMSATS Institute of Information Technology)H-Index: 16
We study the multi-objective problem of mapping independent tasks onto a set of data center machines that simultaneously minimizes the energy consumption and response time (makespan) subject to the constraints of deadlines and architectural requirements. We propose an algorithm based on goal programming that effectively converges to the compromised Pareto optimal solution. Compared to other traditional multi-objective optimization techniques that require identification of the Pareto frontier, go...
17 CitationsSource
#1Samee U. Khan (NDSU: North Dakota State University)H-Index: 69
#2Pascal Bouvry (University of Luxembourg)H-Index: 38
Last. Thomas Engel (University of Luxembourg)H-Index: 27
view all 3 authors...
High-Performance Computing (HPC) is a major contributor to cutting-edge research and discovery in science and technology. We can attribute several key research findings that were aided or validated by tests and simulations run on HPCs. Over the last decade, we have witnessed computing service providers to continually upgrade their infrastructures to HPCs that can meet the increasing demands of powerful newer applications. In parallel, almost in concert, computing manufacturers have consolidated ...
10 CitationsSource
The most critical and important aspect of disaster recovery is to protect the data from application fail over, natural disasters and infrastructure failures. Taking frequent backups of the huge volumes of data and storing it is also an integral part of the disaster recovery plan. Various scenarios of database replication strategies and techniques are provided in this survey paper addressing the need for replication of data. A wide range of open source and commercial tools have evolved over a per...
16 CitationsSource
Cited By34
#1Xianke SunH-Index: 1
#2Gaoliang WangH-Index: 1
Last. Honglei YuanH-Index: 1
view all 4 authors...
Performance differentiation and optimization are major dimensions and critical activities in cloud computing systems with shared execution infrastructures. Supporting these features from the perspective of cloud architecture, related concerns and requirements are important challenges, which need more in-depth research. In this regard, this work investigates the dark dimensions of the problem toward realizing an integrated architecture scheme. Therefore, the main goals of the research are to inve...
The current study strongly analyzes the factors that determine the acceptance of cloud companies (SASS model), and this strategy to do their job. The research model is designed to explore factors that affect the use of computers, including host technology (TAM) models and other external such as organizational size and technical complexity. Data compiled from 200 companies are used to test the ideas. The results of this study show what important factors need to be considered and how they relate t...
#1Quadri WaseemH-Index: 1
Last. Amril NazirH-Index: 1
view all 5 authors...
Data replications effectively replicate the same data to various multiple locations to accomplish the objective of zero loss of information in case of failures without any downtown. Dynamic data replication strategies (providing run time location of replicas) in clouds should optimize the key performance indicator parameters, like response time, reliability, availability, scalability, cost, availability, performance, etc. To fulfill these objectives, various state-of-the-art dynamic data replica...
#1Sarra Slimani (Tunis University)H-Index: 2
#2Tarek Hamrouni (Tunis University)H-Index: 12
Last. Faouzi Ben Charrada (Tunis University)H-Index: 7
view all 3 authors...
The recent years have witnessed significant interest in migrating different applications into the cloud platforms. In this context, one of the main challenges for cloud applications providers is how to ensure high availability of the delivered applications while meeting users’ QoS. In this respect, replication techniques are commonly applied to efficiently handle this issue. From the literature, according to the used granularity for replication there are two major approaches to achieve replicati...
9 CitationsSource
#1Shady S. RefaatH-Index: 2
#2Omar EllabbanH-Index: 17
Last. Miroslav BegovicH-Index: 43
view all 0 authors...
The smart grid (SG) allows integration of renewable energy sources, distributed generation (DG) and storage systems. This chapter builds on the concepts of data management and analytics in SG to build the foundation needed for data analytics to transform Big Data for high‐value action. Big data sources in SG generally fall into two main categories; electric utility data sources and supplementary data sources. The Big Data system will store, process, and mine information in an efficient manner to...
2 CitationsSource
#1Panagiotis Moutafis (UTH: University of Thessaly)H-Index: 2
#1Panagiotis Moutafis (UTH: University of Thessaly)H-Index: 2
Last. Luis Iribarne (UAL: University of Almería)H-Index: 13
view all 6 authors...
Given two datasets of points (called Query and Training), the Group (K) Nearest-Neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been studied during the recent years and several performance improving techniques and pruning heuristics have been proposed. In previous work, we presented the first MapReduce algorithm, consisting of alternating local and parallel phases, which can be used to effectively p...
1 CitationsSource
We witness an explosive growth in the volume, velocity, and variety of the data available on the Internet. Such “data explosion” is mostly caused by the mobile devices, sensors, actuators, and social networks. Despite significant technological advancements, the users quality of experience is barely met. The ever-increasing demands of users related to high storage, fast computation, and processing has led toward the innovation of 5G technology. In this article, we conducted a study to highlight t...
3 CitationsSource
#2Belabbas YagoubiH-Index: 1
Last. Fatima Zohra BellounarH-Index: 1
view all 3 authors...
Cloud Computing provides on demand resources for customers and enterprises to outsource their online activities efficiently and less expensively. However, the cloud environment is heterogeneous and very dynamic, storage node failures and increasing demands on data can lead to data unavailability situations leading to a decrease in quality of service. Cloud service providers face the challenge of ensuring maximum data availability and reliability. Replication of data to different nodes in the clo...
1 CitationsSource