Sunday 25 December 2016

Hadoop Solution for Big Data

The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of unstructured and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that is deep and computationally extensive, like clustering and targeting. 

Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built.

Generally, in vast companies, if you want to maintain a data center, it needs a very expensive hardware machines and network system in order to be less failure effective. What if you have a small company(basically a start-up) you cannot go for an extortionate expensive hardware. Because the cost of Data machines itself implies your entire start-up economy. Since we are dealing with the big data thing, we need high space machines and we want data storage to be installed in very less valuation. What's the solution? Simple. Use Hadoop! in less expensive machines instead of going for more priced ones. 

Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity(less cost) servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is the bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy. We store data in multiple systems because as we are using commodity hardware, we should be in a safe mode in order to protect our data.

In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster(distributed system), every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set. Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads it out.

In a more detailed way, We, in general, write the code, bring the data into the code by performing some file operations, execute them and send the result back. But in Hadoop architecture, we write the code, and we send the code to the place where data is stored in distributed systems. This is the main advantage in Hadoop systems. How? How it is done..I'll post in further posts. Please stay tuned.

Is only Hadoop available in the market to deal with the big data?
The answer to this question is NO. There are many alternative frameworks that are available in the current world to maintain the big data. Some of them are Cloudera, Horton Works, MapR, Spark, Octopy, Sphere etc., We cannot compare each and every vendor since each got some special feature which is not present in the another one. 

Advantages of Hadoop:


1. Scalable


  • Hadoop is a highly scalable storage platform because it can stores and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems (RDBMS) that can’t scale to process large amounts of data, Hadoop enables businesses to run applications on thousands of nodes involving many thousands of terabytes of data.

2. Cost effective

  • Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. The problem with traditional relational database management systems is that it is extremely cost prohibitive to scale to such a degree in order to process such massive volumes of data. In an effort to reduce costs, many companies in the past would have had to down-sample data and classify it based on certain assumptions as to which data was the most valuable. The raw data would be deleted, as it would be too cost-prohibitive to keep. While this approach may have worked in the short term, this meant that when business priorities changed, the complete raw data set was not available, as it was too expensive to store.

3. Flexible

  • Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. This means businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations.  Hadoop can be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing, market campaign analysis and fraud detection.

4. Fast

  • Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located in a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.

5. Resilient to failure

  • A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.




Disadvantages of Hadoop:

As the backbone of so many implementations, Hadoop is almost synonymous with big data.

1. Security Concerns

Just managing complex applications such as Hadoop can be challenging. A simple example can be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If whoever managing the platform lacks of know how to enable it, your data could be at huge risk. Hadoop is also missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer to keep their data under wraps.

2. Vulnerable By Nature

Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The framework is written almost entirely in Java, one of the most widely used yet controversial programming languages in existence. Java has been heavily exploited by cyber criminals and as a result, implicated in numerous security breaches.

3. Not Fit for Small Data

While big data is not exclusively made for big businesses, not all big data platforms are suited for small data needs. Unfortunately, Hadoop happens to be one of them. Due to its high capacity design, the Hadoop Distributed File System lacks the ability to efficiently support the random reading of small files. As a result, it is not recommended for organizations with small quantities of data.

4. Potential Stability Issues

Like all open source software, Hadoop has had its fair share of stability issues. To avoid these issues, organizations are strongly recommended to make sure they are running the latest stable version, or run it under a third-party vendor equipped to handle such problems.

5. General Limitations

The article introduces Apache Flume, MillWheel, and Google’s own Cloud Dataflow as possible solutions. What each of these platforms has in common is the ability to improve the efficiency and reliability of data collection, aggregation, and integration. The main point the article stresses is that companies could be missing out on big benefits by using Hadoop alone.

Stay tuned for more posts.Thank You for reading and spending your time here.

2 comments:

  1. Do hadoop and cloudera belong to the same kind of applications?
    I think hadoop and cloudera have intersection = NULL on their domains of problem set.

    ReplyDelete