Apache Hadoop

Apache Hadoop – An open-source platform called Apache Hadoop® enables the distributed processing of enormous data sets using straightforward programming paradigms. Massive amounts of structured, semi-structured, and unstructured data can be stored and processed cost-effectively and without regard to format thanks to Hadoop, which is based on clusters of common computers. Hadoop is therefore perfect for creating data lakes to support big data analytics initiatives.

The management of data processing and storage for big data applications is handled by Apache Hadoop, an open-source Java-based software platform. The platform operates by dividing Hadoop big data and analytics jobs into smaller workloads that can be run in parallel and distributing them across nodes in a computing cluster. Hadoop handles both structured and unstructured data and can scale reliably from a single server to thousands of machines.

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common – Provides common Java libraries that can be used across all modules.

Hadoop Distributed File System (HDFS) – a distributed file system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. HDFS has five services as follows:

  • Name Node
  • Secondary Name Node
  • Job tracker
  • Data Node
  • Task Tracker

Yet Another Resource Negotiator (YARN) – Manages and monitors cluster nodes and resource usage. It schedules jobs and tasks.

Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.

Hadoop Ozone – (introduced in 2020) An object store for Hadoop.

Commercial applications of Hadoop include

  • Log or clickstream analysis.
  • Marketing Analytics.
  • Machine learning and data mining.
  • Image processing.
  • XML message processing.
  • Web Crawling.
  • Archival work for compliance, including relational and tabular data.
Apache Hadoop

Difference between Hadoop and Hadoop (YARN)

The addition of YARN (Yet Another Resource Negotiator), which replaced the MapReduce engine in the first version of Hadoop, is the main distinction between Hadoop 1 and Hadoop 2. YARN aims to effectively distribute resources among various applications. It runs two daemons that handle two distinct tasks: the resource manager, which manages resource allocation to applications and job tracking, and the application master, which keeps track of the execution’s progress.

What is Hadoop Programming?

In the Hadoop framework, code is mostly written in Java but some native code is based in C. Additionally, command-line utilities are typically written as shell scripts. For Hadoop MapReduce, Java is most commonly used but through a module like Hadoop streaming, users can use the programming language of their choice to implement the map and reduce functions.

How Hadoop Works

Utilizing all of the storage and processing power of cluster servers and running distributed processes against enormous amounts of data are made simpler by Hadoop. On top of Hadoop’s building blocks, other services and applications can be developed.

Applications that gather data in various formats can add data to the Hadoop cluster by connecting to the NameNode using an API operation. Each file’s “chunk” placement and file directory organisation are tracked by the NameNode and replicated across DataNodes. Provide a MapReduce job made up of numerous maps and reduce tasks that run against the data in HDFS distributed across the DataNodes in order to run a job to query the data. Each node runs a map task against the supplied input files, and reducers run to aggregate and arrange the output.

The extensibility of the Hadoop ecosystem has allowed for significant growth over time. Numerous tools and applications are now part of the Hadoop ecosystem and can be used to gather, store, process, analyse, and manage large amounts of data. The most well-liked applications include:

Presto – a low-latency, distributed SQL query engine that is open source that is designed for ad-hoc data analysis. The ANSI SQL standard is supported, and this includes advanced queries, aggregations, joins, and window functions. Numerous data sources, such as the Hadoop Distributed File System (HDFS) and Amazon S3, can be processed by Presto.

Spark – a distributed processing system that is open source and frequently used for big data workloads. Apache Spark supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries. It uses in-memory caching and optimised execution for quick performance.

Hive – allows users to use Hadoop MapReduce through a SQL interface, enabling distributed and fault-tolerant data warehousing in addition to large-scale analytics.

HBase – An open source, versioned, a non-relational database built on top of either the Hadoop Distributed File System or Amazon S3 (using EMRFS) (HDFS). For tables with billions of rows and millions of columns, HBase is a massively scalable, distributed big data store designed for random, strictly consistent, real-time access.

Zeppelin – An interactive notebook that enables interactive data exploration.

When was Hadoop Invented?

When search engines like Yahoo and Google were just getting started, there was a need to process ever-increasingly large volumes of big data and deliver web results more quickly. Doug Cutting and Mike Cafarella founded Hadoop in 2002 while working on the Apache Nutch project. They were inspired by Google’s MapReduce programming model, which divides an application into small portions to run on various nodes. Hadoop was given the name Doug after his son’s toy elephant, according to a New York Times article.

A few years later, Nutch and Hadoop were separated. Hadoop took on the role of distributed computing and processing while Nutch concentrated on the web crawler component. In 2008, Yahoo released Hadoop as an open-source project, two years after Cutting joined the company. Hadoop was released to the general public as Apache Hadoop in November 2012 by the Apache Software Foundation (ASF).

What is Hadoop Used for?

Hadoop is used for both production and research by a wide range of businesses and organisations. On the Hadoop PoweredBy wiki page, users are encouraged to add their names. There are countless potential applications for Hadoop:

  • Finance
  • Retail
  • Healthcare
  • Security and Law Enforcement

Download Apache Hadoop

For convenience, Hadoop is distributed as source code tarballs with corresponding binary tarballs. The downloads are distributed through mirror sites, and GPG or SHA-512 should be used to check for tampering.

See Also: Linux OS

References
  • “Hadoop Releases”. apache.org. Apache Software Foundation. Retrieved 28 April 2019.
  • “Welcome to Apache Hadoop!”. hadoop.apache.org. Retrieved 25 August 2016.
  • “What is the Hadoop Distributed File System (HDFS)?”. ibm.com. IBM. Retrieved 12 April 2021.
  • “Resource (Apache Hadoop Main 2.5.1 API)”. apache.org. Apache Software Foundation. 12 September 2014. Archived from the original on 6 October 2014. Retrieved 30 September 2014.
  • wikipedia.org/wiki/Apache_Hadoop
External links

Leave a Comment