If you have more data that means you can comfortably beat better algorithms that work with fewer data as we give more accurate results. Big Data Big Data Big Data is here. What's the problem? The bad news is that we are struggling to store and analyze that.
The problem is simple: although the storage capacities of hard drives have increased
transfer speed of 4.4 MB/s so you could read all the data from a full around five
minutes. Over 20 years later, 1-terabyte drives are the norm, but the transfer speed
is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.
The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had
100 drives, each holding one hundredth of the data. Working in parallel, we could read
the data in under two minutes.
Using only one hundredth of a disk may seem wasteful. But we can store 100 data sets,
each of which is 1 terabyte, and provide shared access to them. We can imagine that the
users of such a system would be happy to share access in return for shorter analysis
times, and statistically, that their analysis jobs would be likely to be spread over time,
read and write data in parallel to or from multiple disks, though.
The first problem to solve is hardware failure: as soon as you start using many pieces of
hardware, the chance that one will fail is fairly high. A common way of avoiding data
loss is through replication: redundant copies of the data are kept by the system so that
in the event of failure, there is another copy available. This is how RAID works, for
instance, although Hadoop’s filesystem, the Hadoop Distributed Filesystem (HDFS),
takes a slightly different approach, as you shall see later.
The second problem is that most analysis tasks need to be able to combine the data in someway
the other 99 disks. Various distributed systems allow data to be combined from mul‐
, and data read from one disk may need to be combined with data from any of
tiple sources, but doing this correctly is notoriously challenging. MapReduce provides a
programming model that abstracts the problem from disk reads and writes, trans‐
forming it into a computation over sets of keys and values. We look at the details of this
model in later chapters, but the important point for the present discussion is that there
are two parts to the computation—the map and the reduce—and it’s the interface be‐
tween the two where the “mixing” occurs. Like HDFS, MapReduce has built inreliability.
In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and
analysis. What’s more, because it runs on commodity hardware and is open source,
Hadoop is affordable.
No comments:
Post a Comment