THE NEXT BIG THING?
Hot on the heels of Web 2.0 and cloud computing, Big Data is without doubt one of the Next Big Things in the IT world. Whereas Web 2.0 linked people and things online, and cloud computing involves the transition to an online computing infrastructure, Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques. This page provides an overview of Big Data characteristics, technologies and opportunities. This information is also summarized in the following video.
THE BIG DATA EXPLOSION
The quantity of computer data being generated on Planet Earth is growing exponentially for a number of related reasons. For a start, as a result of e-commerce and loyalty card schemes, retailers are starting to build up vast databases of recorded customer activity. Organizations working in the logistics, financial services, healthcare and many other sectors are also now capturing more and more data and wish to generate additional value from it. The public use of social media is also creating vast quantities of digital material that may potentially be mined and crowdsourced to generate valuable insights.
As vision recognition improves, it is additionally starting to become possible for computers to glean useful information and data relationships from still images and video. As more smart objects go online, Big Data is also being generated by an expanding Internet of Things (IoT). And finally, several areas of scientific advancement -- including rapid genome sequencing, nanotechnology, synthetic biology, and climate simulation -- are starting to generate and rely on vast quantities of data that were until very recently almost unimaginable.
Capturing, storing and generating value from Big Data raises a number of technical and conceptual challenges that go beyond the capabilities of traditional computing. To get a handle on the issues involved, most commentators describe the characteristics and challenges of Big Data using the "three Vs" of volume, velocity and variety (a model first developed by Doug Laney).
Volume is Big Data's greatest challenge and as well as its greatest opportunity. This is because storing, interlinking and processing vast quantities of digital information offers tremendous possibilities for a wide range of activities. These include predicting customer behaviour, diagnosing disease, planning healthcare services, and modelling our climate. However, traditional computing solutions like relational databases are increasingly not capable of handling such tasks. Most traditional computer hardware solutions are also not scalable to Big Data proportions.
Big data velocity also raises a number of key issues. For a start, the rate at which data is flowing into most organizations is increasing beyond the capacity of their IT systems to store and process. In addition, users increasingly want streaming data to be delivered to them in real time, and often on mobile devices. Online video, location tracking, augmented reality and many other applications now rely on large quantities of such high velocity data streams, and for many companies delivering them is proving quite a challenge.
Finally, as already highlighted, Big Data is characterised by its variety, with the types of data that many organizations are called on to process becoming increasingly diverse and dense. Gone are the days when data centres only had to process documents, financial transactions, stock records, and personnel files. Today, photographs, audio, video, 3D models, complex simulations and location data are all being piled in to many a corporate data silo. Many of these Big Data sources are also almost entirely unstructured, and hence not easy to categorize, let alone process, with traditional computing techniques. All of this means that Big Data is in reality messy data, with a great deal of effort required in complex pre-processing and data cleansing before any meaningful analysis can be carried out.
Due to the challenges of volume, velocity and variety, many organizations at present have little choice but to ignore or rapidly excrete very large quantities of potentially very valuable information. Indeed, if we think of organizations as creatures that process data, then most are currently rather primitive forms of life. Their sensors and IT systems are simply not up to the job of scanning and interpreting the vast oceans of data in which they swim. As a consequence, most of the data that surrounds organizations today is ignored. A large proportion of the data that they gather is then not processed, with a significant quantity of useful information passing straight through them as "data exhaust".
For example, until very recently the vast majority of the data captured via retailer loyalty card systems was not processed in any way. And still today, almost all video data captured by hospitals during surgery is deleted within weeks. This is almost scandalous given that interlinking and intelligently mining these image streams could improve both individual patient outcomes and wider healthcare planning.
Due to the issues raised by its volume, velocity and variety, Big Data requires new technology solutions. Currently leading the field is an open-source project from Apache called Hadoop. This is developing a software library for reliable, scalable, distributed computing systems capable of handling the Big Data deluge, and provides the first viable platform for Big Data analytics. Hadoop is already used by most Big Data pioneers. For example, LinkedIn currently uses Hadoop to generate over 100 billion personalized recommendations every week.
What Hadoop does is to distribute the storage and processing of large data sets across groups or "clusters" of server computers using a simple programming model. The number of servers in a cluster can also be scaled easily as requirements dictate, from maybe 50 machines to perhaps 2000 or more. Whereas traditional large-scale computing solutions rely on expensive server hardware with a high fault tolerance, Hadoop detects and compensates for hardware failures or other system problems at the application level. This allows a high level of service continuity to be delivered from clusters of individual server computers, each of which may be prone to failure. Processing vast quantities of data across large, lower-cost distributed computing infrastructures therefore becomes a viable proposition.
Technically, Hadoop consists of two key elements. The first is the Hadoop Distributed File System (HDFS), which permits the high-bandwidth, cluster-based storage essential for Big Data computing. The second part of Hadoop is then a data processing framework called MapReduce. Based on Google's search technology, this distributes or "maps" large data sets across multiple servers. Each of these servers then performs processing on the part of the overall data set it has been allocated, and from this creates a summary. The summaries created on each server are then aggregated in the so-termed "Reduce" stage. This approach allows extremely large raw data sets to be rapidly pre-processed and distilled before more traditional data analysis tools are applied.
At present, many Big Data pioneers are deploying a Hadoop ecosystem alongside their legacy IT systems in order to allow them to combine old and new data in new ways. However, in time, Hadoop may be destined to replace many traditional data warehouse and rigidly-structured relational database technologies and to become the dominant platform for many kinds of data processing. More information on Hadoop can be found in this excellent blog post by Ravi Kalakota.
Many organizations are unlikely to have the resources and expertise to implement their own Hadoop solutions. Fortunately they do not have to, as cloud solutions are already available. Offered by providers including Amazon, Netapp and Google, these allow organizations of all sizes to start benefiting from the potential of Big Data processing. Where public Big Data sets need to be utilized, running everything in the cloud also makes a lot of sense, as the data does not have to be downloaded to an organization's own system. For example, Amazon Web Services already hosts many public data sets. These include US and Japanese Census data, and many genomic and other medical and scientific Big Data repositories.
Looking further ahead, Big Data will progress in leaps and bounds as artificial intelligence advances, and as new types of computer processing power become available. For example, quantum computing may in the future greatly improve Big Data processing. Quantum computers store and process data using quantum mechanical states, and will in theory excel at the massively parallel processing of unstructured data. (For more information, please see the quantum computing section).
While mining datasets measured in the terabytes, petabytes and even exabytes is technically challenging, it also offers significant opportunities. In fact, not that many years hence, Big Data techniques and technologies are likely to allow some kind of additional, secondary value to be generated from almost every piece of digital information that ever gets stored. As IBM explain "Big Data . . . is an opportunity to find insight in new and emerging types of data, to make [businesses] more agile, and to answer questions that, in the past, were beyond reach". Or as Oracle put it in their "Integrate for Insight" white paper, Big Data "holds the promise of giving enterprises deeper insight into their customers, partners and business" -- including answers to questions "they may not have even thought to ask".
More specifically, and as O'Reilly Radar argues, Big Data has the potential to improve analytical insight, as well as allowing the creation of new products and services that were previously impossible. Pioneers such as Google, Amazon and Facebook have already demonstrated how Big Data can permit the delivery of highly personalised search results, advertising, and product recommendations. In time, Big Data may also help farmers to accurately forecast bad weather and crop failures. Governments may use Big Data to predict and plan for civil unrest or pandemics.
As Netapp explain, Big Data developments are fundamentally about creating new IT systems that are more "systems of engagement" rather than just silos for data storage. For too long we have got used to putting data into computer systems for relatively little return. But by amalgamating and analyzing increasingly big datasets, we may start to get more value out of computer systems than we put in.
For example, by using Big Data techniques to analyze the 12 terabytes of tweets written every day, it is already becoming possible to conduct real-time sentiment analysis to find out how the world feels about things. Such a service is indeed already now offered for free by Sentiment140.com. But this really is just the beginning, with Big Data offering all sorts of possibilities to potentially augment and improve the services that organizations deliver to their customers.
In a recent report on Big Data, the McKinsey Global Institute estimated that the US healthcare sector could achieve $300 billion in efficiency and quality savings every year by leveraging Big Data, so cutting healthcare expenditures by about 8 per cent. Across Europe, they also estimate that using Big Data could save at least $149 billion in government administration costs per year. More broadly, in manufacturing firms, integrating Big Data across R&D, engineering and production may significantly reduce time to market and improve product quality.
While Big Data may undoubtedly provide all manner of organizations with a data stalking ability that many may fear, the positive implications of Big Data are likely to outweigh the negative possibilities. For example, Big Data may increase sustainability by improving traffic management in cities and permitting the smarter operation of electricity generation infrastructures.
In effect, using Big Data, we could start to run the world and allocate resources based on what we really need, not what we blindly guess people may in the near future demand. Or in other words, the more we can know and learn about human activities, the less we will need to go on producing and transporting goods to fill up retail outlets with things that people may not actually want. Although it is currently a corporate computing development, the Big Data movement may therefore find that it has a great many advocates in the years and decades ahead . . .