Apache Authors: Pat Romanski, Liz McMillan, Elizabeth White, Christopher Harrold, Janakiram MSV

Related Topics: @BigDataExpo, @CloudExpo, Apache

@BigDataExpo: Blog Feed Post

Big Data and Analytics By @MElRefaey | @BigDataExpo #BigData

Hadoop is a framework that simplifies the processing of data sets distributed across clusters of servers

This post is the first in a series of blog posts that will explore and exploit the Big Data and analytics tools. I will walk through easy steps to start working with such tools like Apache Hadoop, Pig, Mahout and solve some problems related to analytics and learning in the large scale by exploiting such tools, and shed the light on some of the challenges we face while working with these tools.

1. Apache Hadoop
1.1 Overview

Hadoop is a framework that simplifies the processing of data sets distributed across clusters of servers. Two of the main components of Hadoop are HDFS and MapReduce.HDFS is the file system that is used by Hadoop to store all the data. This file system spans across all the nodes that are being used by Hadoop. These nodes could be on a single server or they can be spread across a large number of servers.In this section, we will go through the instruction of how to get the Hadoop up and running with the configurations needed to make it useful for other components/frameworks that integrate or depends on Hadoop (e.g. Hive, Pig, HBase etc.).

Note: The installation will be a Pseudo distribution.

1.2 Tools and Versions
I've used the following tools and versions throughout this installation:

  • Ubuntu 14.04 LTS
  • Java 1.7.0_65 (java-7-openjdk-amd64)
  • Hadoop 2.5.1

1.3    Installation and Configurations

1. Install Java using the following command:

apt-get update apt-get install default-jdk

2. Create Security Keys using the following commands:

ssh-keygen -t rsa -P ' ' cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

3. Download Hadoop tar file using:

wget http://www.webhostingreviewjam.com/mirror/apache/hadoop/common/hadoop-2.5.1/hadoop-2.5.1.tar.gz

4. Extract the tar file using:

tar -xzvf hadoop-2.5.1.tar.gz

5. Move the extracted files into a location you can easily recognize, and easily change the version used without much modifications using:

mv hadoop-2.5.1/ /usr/local/hadoop

6. Configure the following environment variables in the bashrc file (to make sure every time they are set with the machine sartup):


7. Source the bashrc file after changes, for the system to recognize the changes using the following command:

source ~/.bashrc

8. Edit the Hadoop-env.sh using vim:

vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

The hadoop-env.sh file should look like this:

That will make the value of the JAVA_HOME always available to Hadoop whenever it starts.

9-      Edit the core-site.xml file using vim as well:

vim /usr/local/hadoop/etc/hadoop/core-site.xml
The file will look like:

10-   Edit the YARN file yarn-site.xml as follows:

vim /usr/local/hadoop/etc/hadoop/yarn-site.xml
The file will look like:

11. Create and edit the mapred-site.xml file:

vim /usr/local/hadoop/etc/hadoop/mapred-site.xml
The file will contains the following property, that specify which framework will be used for MapReduce:

12. Edit the hdfs-site.xml file, in order to specify the directories that will be used as datanode and namenode on that server.

vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Create the two directories:  mkdir -p /usr/local/hadoop_store/hdfs/namenode                mkdir -p /usr/local/hadoop_store/hdfs/datanode
after editing the file, it will contains the following  properties:

13.Forma t the new Hadoop file system using the following command:

hdfs namenode -format
Note: This operation needs to be done once before we start using Hadoop. If it is executed again after Hadoop has been used, it'll destroy all the data on the Hadoop filesystem.

14. Now, all configurations are done, we can start using Hadoop, we should first run the following shell scripts:

start-dfs.sh                     start-yarn.sh
And to make sure everything is okay, and the right process is running, run the command jps and see the following:

15-   We can run MapReduce examples that exist in Hadoop bundle, but we need to run the following:

We should create the HDFS directories required to execute MapReduce jobs:
hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mohamed
and copy the input files to be processed into the distributed filesystem:
hdfs dfs -put {here is the path to the files to be copied} input

16-   We can check the web console for the resource manager, HDFS nodes and running jobs as shown in the following screens:

Issues and problems:

I've experienced some issues related to: Ø  Formatting the HDFS, and I resolved it by changing permissions and ownership of the user who can format the namenode and datanode. Ø  Problem connecting to the resource manager, with the following error: ipc.Client: Retrying connect to server: Already tried 0 time(s); maxRetries=45
INFO client.RMProxy: Connecting to ResourceManager at / And I resolved it by: adding a few properties to yarn-site.xml :

We reached to the end of our first post on big data and analytics, hope you enjoyed reading it and experiminting with Hadoop installation and configuration. next post will be about Apache Pig.

Read the original blog entry...

More Stories By Mohamed El-Refaey

Work as head of research and development at EDC (Egypt Development Center) a member of NTG. previously worked for Qlayer, Acquired by (Sun Microsystems), when my passion about cloud computing domain started. with more than 10 years of experience in software design and development in e-commerce, BPM, EAI, Web 2.0, Banking applications, financial market, Java and J2EE. HIPAA, SOX, BPEL and SOA, and late two year focusing on virtualization technology and cloud computing in studies, technical papers and researches, and international group participation and events. I've been awarded in recognition of innovation and thought leadership while working as IT Specialist at EDS (an HP Company). Also a member of the Cloud Computing Interoperability Forum (CCIF) and member of the UCI (Unified Cloud Interface) open source project, in which he contributed with the project architecture.

@ThingsExpo Stories
SYS-CON Events announced today that Secure Channels, a cybersecurity firm, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Secure Channels, Inc. offers several products and solutions to its many clients, helping them protect critical data from being compromised and access to computer networks from the unauthorized. The company develops comprehensive data encryption security strategie...
SYS-CON Events announced today that App2Cloud will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. App2Cloud is an online Platform, specializing in migrating legacy applications to any Cloud Providers (AWS, Azure, Google Cloud).
WebRTC is the future of browser-to-browser communications, and continues to make inroads into the traditional, difficult, plug-in web communications world. The 6th WebRTC Summit continues our tradition of delivering the latest and greatest presentations within the world of WebRTC. Topics include voice calling, video chat, P2P file sharing, and use cases that have already leveraged the power and convenience of WebRTC.
Internet-of-Things discussions can end up either going down the consumer gadget rabbit hole or focused on the sort of data logging that industrial manufacturers have been doing forever. However, in fact, companies today are already using IoT data both to optimize their operational technology and to improve the experience of customer interactions in novel ways. In his session at @ThingsExpo, Gordon Haff, Red Hat Technology Evangelist, shared examples from a wide range of industries – including en...
Detecting internal user threats in the Big Data eco-system is challenging and cumbersome. Many organizations monitor internal usage of the Big Data eco-system using a set of alerts. This is not a scalable process given the increase in the number of alerts with the accelerating growth in data volume and user base. Organizations are increasingly leveraging machine learning to monitor only those data elements that are sensitive and critical, autonomously establish monitoring policies, and to detect...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. Jack Norris reviews best practices to show how companies develop, deploy, and dynamically update these applications and how this data-first...
Intelligent Automation is now one of the key business imperatives for CIOs and CISOs impacting all areas of business today. In his session at 21st Cloud Expo, Brian Boeggeman, VP Alliances & Partnerships at Ayehu, will talk about how business value is created and delivered through intelligent automation to today’s enterprises. The open ecosystem platform approach toward Intelligent Automation that Ayehu delivers to the market is core to enabling the creation of the self-driving enterprise.
"We're a cybersecurity firm that specializes in engineering security solutions both at the software and hardware level. Security cannot be an after-the-fact afterthought, which is what it's become," stated Richard Blech, Chief Executive Officer at Secure Channels, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Consumers increasingly expect their electronic "things" to be connected to smart phones, tablets and the Internet. When that thing happens to be a medical device, the risks and benefits of connectivity must be carefully weighed. Once the decision is made that connecting the device is beneficial, medical device manufacturers must design their products to maintain patient safety and prevent compromised personal health information in the face of cybersecurity threats. In his session at @ThingsExpo...
SYS-CON Events announced today that Massive Networks will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Massive Networks mission is simple. To help your business operate seamlessly with fast, reliable, and secure internet and network solutions. Improve your customer's experience with outstanding connections to your cloud.
The question before companies today is not whether to become intelligent, it’s a question of how and how fast. The key is to adopt and deploy an intelligent application strategy while simultaneously preparing to scale that intelligence. In her session at 21st Cloud Expo, Sangeeta Chakraborty, Chief Customer Officer at Ayasdi, will provide a tactical framework to become a truly intelligent enterprise, including how to identify the right applications for AI, how to build a Center of Excellence to ...
From 2013, NTT Communications has been providing cPaaS service, SkyWay. Its customer’s expectations for leveraging WebRTC technology are not only typical real-time communication use cases such as Web conference, remote education, but also IoT use cases such as remote camera monitoring, smart-glass, and robotic. Because of this, NTT Communications has numerous IoT business use-cases that its customers are developing on top of PaaS. WebRTC will lead IoT businesses to be more innovative and address...
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution and join Akvelon expert and IoT industry leader, Sergey Grebnov, in his session at @ThingsExpo, for an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.
Because IoT devices are deployed in mission-critical environments more than ever before, it’s increasingly imperative they be truly smart. IoT sensors simply stockpiling data isn’t useful. IoT must be artificially and naturally intelligent in order to provide more value In his session at @ThingsExpo, John Crupi, Vice President and Engineering System Architect at Greenwave Systems, will discuss how IoT artificial intelligence (AI) can be carried out via edge analytics and machine learning techn...
SYS-CON Events announced today that GrapeUp, the leading provider of rapid product development at the speed of business, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market acr...
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
In his opening keynote at 20th Cloud Expo, Michael Maximilien, Research Scientist, Architect, and Engineer at IBM, discussed the full potential of the cloud and social data requires artificial intelligence. By mixing Cloud Foundry and the rich set of Watson services, IBM's Bluemix is the best cloud operating system for enterprises today, providing rapid development and deployment of applications that can take advantage of the rich catalog of Watson services to help drive insights from the vast t...
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
Recently, IoT seems emerging as a solution vehicle for data analytics on real-world scenarios from setting a room temperature setting to predicting a component failure of an aircraft. Compared with developing an application or deploying a cloud service, is an IoT solution unique? If so, how? How does a typical IoT solution architecture consist? And what are the essential components and how are they relevant to each other? How does the security play out? What are the best practices in formulating...
In his session at @ThingsExpo, Arvind Radhakrishnen discussed how IoT offers new business models in banking and financial services organizations with the capability to revolutionize products, payments, channels, business processes and asset management built on strong architectural foundation. The following topics were covered: How IoT stands to impact various business parameters including customer experience, cost and risk management within BFS organizations.