Welcome!

Apache Authors: Liz McMillan, Pat Romanski, Elizabeth White, Christopher Harrold, Janakiram MSV

Related Topics: @BigDataExpo, @CloudExpo, Apache

@BigDataExpo: Blog Feed Post

Big Data and Analytics By @MElRefaey | @BigDataExpo #BigData

Hadoop is a framework that simplifies the processing of data sets distributed across clusters of servers

This post is the first in a series of blog posts that will explore and exploit the Big Data and analytics tools. I will walk through easy steps to start working with such tools like Apache Hadoop, Pig, Mahout and solve some problems related to analytics and learning in the large scale by exploiting such tools, and shed the light on some of the challenges we face while working with these tools.

1. Apache Hadoop
1.1 Overview

Hadoop is a framework that simplifies the processing of data sets distributed across clusters of servers. Two of the main components of Hadoop are HDFS and MapReduce.HDFS is the file system that is used by Hadoop to store all the data. This file system spans across all the nodes that are being used by Hadoop. These nodes could be on a single server or they can be spread across a large number of servers.In this section, we will go through the instruction of how to get the Hadoop up and running with the configurations needed to make it useful for other components/frameworks that integrate or depends on Hadoop (e.g. Hive, Pig, HBase etc.).

Note: The installation will be a Pseudo distribution.

1.2 Tools and Versions
I've used the following tools and versions throughout this installation:

  • Ubuntu 14.04 LTS
  • Java 1.7.0_65 (java-7-openjdk-amd64)
  • Hadoop 2.5.1

1.3    Installation and Configurations

1. Install Java using the following command:

apt-get update apt-get install default-jdk

2. Create Security Keys using the following commands:

ssh-keygen -t rsa -P ' ' cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

3. Download Hadoop tar file using:

wget http://www.webhostingreviewjam.com/mirror/apache/hadoop/common/hadoop-2.5.1/hadoop-2.5.1.tar.gz

4. Extract the tar file using:

tar -xzvf hadoop-2.5.1.tar.gz

5. Move the extracted files into a location you can easily recognize, and easily change the version used without much modifications using:

mv hadoop-2.5.1/ /usr/local/hadoop

6. Configure the following environment variables in the bashrc file (to make sure every time they are set with the machine sartup):

#HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END

7. Source the bashrc file after changes, for the system to recognize the changes using the following command:

source ~/.bashrc

8. Edit the Hadoop-env.sh using vim:

vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

The hadoop-env.sh file should look like this:


That will make the value of the JAVA_HOME always available to Hadoop whenever it starts.

9-      Edit the core-site.xml file using vim as well:

vim /usr/local/hadoop/etc/hadoop/core-site.xml
The file will look like:

10-   Edit the YARN file yarn-site.xml as follows:

vim /usr/local/hadoop/etc/hadoop/yarn-site.xml
The file will look like:

11. Create and edit the mapred-site.xml file:

vim /usr/local/hadoop/etc/hadoop/mapred-site.xml
The file will contains the following property, that specify which framework will be used for MapReduce:


12. Edit the hdfs-site.xml file, in order to specify the directories that will be used as datanode and namenode on that server.

vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Create the two directories:  mkdir -p /usr/local/hadoop_store/hdfs/namenode                mkdir -p /usr/local/hadoop_store/hdfs/datanode
after editing the file, it will contains the following  properties:


13.Forma t the new Hadoop file system using the following command:

hdfs namenode -format
Note: This operation needs to be done once before we start using Hadoop. If it is executed again after Hadoop has been used, it'll destroy all the data on the Hadoop filesystem.

14. Now, all configurations are done, we can start using Hadoop, we should first run the following shell scripts:

start-dfs.sh                     start-yarn.sh
And to make sure everything is okay, and the right process is running, run the command jps and see the following:



15-   We can run MapReduce examples that exist in Hadoop bundle, but we need to run the following:

We should create the HDFS directories required to execute MapReduce jobs:
hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mohamed
and copy the input files to be processed into the distributed filesystem:
hdfs dfs -put {here is the path to the files to be copied} input

16-   We can check the web console for the resource manager, HDFS nodes and running jobs as shown in the following screens:







Issues and problems:

I've experienced some issues related to: Ø  Formatting the HDFS, and I resolved it by changing permissions and ownership of the user who can format the namenode and datanode. Ø  Problem connecting to the resource manager, with the following error: ipc.Client: Retrying connect to server:

0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); maxRetries=45
INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 And I resolved it by: adding a few properties to yarn-site.xml :



We reached to the end of our first post on big data and analytics, hope you enjoyed reading it and experiminting with Hadoop installation and configuration. next post will be about Apache Pig.

Read the original blog entry...

More Stories By Mohamed El-Refaey

Work as head of research and development at EDC (Egypt Development Center) a member of NTG. previously worked for Qlayer, Acquired by (Sun Microsystems), when my passion about cloud computing domain started. with more than 10 years of experience in software design and development in e-commerce, BPM, EAI, Web 2.0, Banking applications, financial market, Java and J2EE. HIPAA, SOX, BPEL and SOA, and late two year focusing on virtualization technology and cloud computing in studies, technical papers and researches, and international group participation and events. I've been awarded in recognition of innovation and thought leadership while working as IT Specialist at EDS (an HP Company). Also a member of the Cloud Computing Interoperability Forum (CCIF) and member of the UCI (Unified Cloud Interface) open source project, in which he contributed with the project architecture.

@ThingsExpo Stories
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), will provide an overview of various initiatives to certifiy the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldw...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...
SYS-CON Events announced today that HTBase will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. HTBase (Gartner 2016 Cool Vendor) delivers a Composable IT infrastructure solution architected for agility and increased efficiency. It turns compute, storage, and fabric into fluid pools of resources that are easily composed and re-composed to meet each application’s needs. With HTBase, companies can quickly prov...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
There are 66 million network cameras capturing terabytes of data. How did factories in Japan improve physical security at the facilities and improve employee productivity? Edge Computing reduces possible kilobytes of data collected per second to only a few kilobytes of data transmitted to the public cloud every day. Data is aggregated and analyzed close to sensors so only intelligent results need to be transmitted to the cloud. Non-essential data is recycled to optimize storage.
"I think that everyone recognizes that for IoT to really realize its full potential and value that it is about creating ecosystems and marketplaces and that no single vendor is able to support what is required," explained Esmeralda Swartz, VP, Marketing Enterprise and Cloud at Ericsson, in this SYS-CON.tv interview at @ThingsExpo, held June 7-9, 2016, at the Javits Center in New York City, NY.
SYS-CON Events announced today that Outlyer, a monitoring service for DevOps and operations teams, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Outlyer is a monitoring service for DevOps and Operations teams running Cloud, SaaS, Microservices and IoT deployments. Designed for today's dynamic environments that need beyond cloud-scale monitoring, we make monitoring effortless so you ...
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
What sort of WebRTC based applications can we expect to see over the next year and beyond? One way to predict development trends is to see what sorts of applications startups are building. In his session at @ThingsExpo, Arin Sime, founder of WebRTC.ventures, will discuss the current and likely future trends in WebRTC application development based on real requests for custom applications from real customers, as well as other public sources of information,
China Unicom exhibit at the 19th International Cloud Expo, which took place at the Santa Clara Convention Center in Santa Clara, CA, in November 2016. China United Network Communications Group Co. Ltd ("China Unicom") was officially established in 2009 on the basis of the merger of former China Netcom and former China Unicom. China Unicom mainly operates a full range of telecommunications services including mobile broadband (GSM, WCDMA, LTE FDD, TD-LTE), fixed-line broadband, ICT, data communica...
With the introduction of IoT and Smart Living in every aspect of our lives, one question has become relevant: What are the security implications? To answer this, first we have to look and explore the security models of the technologies that IoT is founded upon. In his session at @ThingsExpo, Nevi Kaja, a Research Engineer at Ford Motor Company, will discuss some of the security challenges of the IoT infrastructure and relate how these aspects impact Smart Living. The material will be delivered i...
Apache Hadoop is emerging as a distributed platform for handling large and fast incoming streams of data. Predictive maintenance, supply chain optimization, and Internet-of-Things analysis are examples where Hadoop provides the scalable storage, processing, and analytics platform to gain meaningful insights from granular data that is typically only valuable from a large-scale, aggregate view. One architecture useful for capturing and analyzing streaming data is the Lambda Architecture, represent...
As organizations realize the scope of the Internet of Things, gaining key insights from Big Data, through the use of advanced analytics, becomes crucial. However, IoT also creates the need for petabyte scale storage of data from millions of devices. A new type of Storage is required which seamlessly integrates robust data analytics with massive scale. These storage systems will act as “smart systems” provide in-place analytics that speed discovery and enable businesses to quickly derive meaningf...
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, will provide a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services ...