What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
What is Big Data?
Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.
Examples Of Big Data:
The New York Stock Exchange generates about one terabyte of new trade data per day
The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day.
This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.Its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour. Plus it gave the first details on its new “Project Prism”.
Characteristics Of Big Data:
(i) Volume — The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data.
(ii) Variety — The next aspect of Big Data is its variety.Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
(iii) Velocity — The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
(iv)Value — Until and unless the big data we have cannot be transformed into something valuable, it is useless. It is very important to understand the cost of resources and effort invested in big data collection and how much value it provides at the end of the data processing. Value is very important because it is what runs the business by impacting business decisions and providing a competitive advantage.
(v)Veracity — Veracity refers to the quality of data. Because data comes from so many different sources, it’s difficult to link, match, cleanse and transform data across systems. Businesses need to connect and correlate relationships, hierarchies and multiple data linkages. Otherwise, their data can quickly spiral out of control.
Problems faced :
1. When do we find Volume as a problem:
A quick web search reveals that a decent 10TB hard drive runs at least $300. To manage a petabyte of data that’s 100 x $300 USD = $30,000 USD. Maybe you’ll get a discount, but even at 50% off, you’re well over $10,000 USD in storage costs alone. Imagine if you just want to keep a redundant version of the data for disaster recovery. You’d need even more disk space. Hence the volume of data becomes a problem when it grows beyond the normal limits and becomes an inefficient and costly way to store on local storage devices.
Amazon Redshift, which is a managed cloud data warehouse service by AWS is one of the popular options for storage
2. When do we find Velocity as a problem:
High-velocity data sounds great because — velocity x time = volume and volume leads to insights, and insights lead to money. However, this path to growing revenue is not without its costs.How do you process every packet of data that comes through your firewall, for maliciousness? How do you process such high-frequency structured and unstructured data on the fly? Moreover, when you have a high velocity of data, that almost always means that there are going to be large swings in the amount of data processed every second, tweets on Twitter are much more active during the Super Bowl than on an average Tuesday, how do you handle that?
Fortunately, “streaming data” solutions have cropped up to the rescue. The Apache organization has popular solutions like Spark and Kafka, where Spark is great for both batch processing and streaming processing, Kafka runs on a publish/subscribe mechanism.
3. When do we find Variety as a problem:
When consuming a high volume of data ,the data can have different data types (JSON, YAML, XML) before one can massage it to a uniform data type to store in a data warehouse. The data processing becomes even more painful when the data columns or keys are not guaranteed to exist forever, such as renaming, introducing, and/or deprecating support for keys in an API. So not only one is trying to squeeze a variety of data types into uniform data type but also the data types can vary from time to time.
One way to deal with a variety of data types is to record every transformation milestone applied to it along the route of your data processing pipeline. Firstly, store the raw data as-is in a data lake( a data lake is a hyper-flexible repository of data collected and kept in its rawest form, like Amazon S3 file storage ). Then transform the raw data with different types of data types into some aggregated and refined state.
4. When do we find Veracity as a problem:
Consider the case of tweets on Twitter, which use things like hashtags, abbreviations, typos, and colloquial speech, all this data have a lot of messiness or noise and as the volume of data increases the noise also increases with it, which can be sometimes exponential too. The noise reduces the overall data quality affecting the data processing and later on data management of the processed data.
If the data is not sufficiently trustworthy, it then becomes important to extract only high-value data as it doesn’t always make sense to collect all the data you can because it is expensive and requires more effort to do so. Filtering out noises as early as possible in the data processing pipeline from the data while data extraction. This leaves only required and trustworthy data which can then be transformed and loaded for data analytics.
Why Is Big Data Important?
The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making. When you combine big data with high-powered analytics, you can accomplish business-related tasks such as:
· Determining root causes of failures, issues and defects in near-real time.
· Generating coupons at the point of sale based on the customer’s buying habits.
· Recalculating entire risk portfolios in minutes.
· Detecting fraudulent behavior before it affects your organization.
Hadoop is an open source, Java based framework used for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance. Developed by Doug Cutting and Michael J. Cafarella, Hadoop uses the Map Reduce programming model for faster storage and retrieval of data from its nodes. The framework is managed by Apache Software Foundation and is licensed under the Apache License 2.0.
Why is Hadoop important?
· Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
· Computing power. Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
· Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
· Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
· Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
· Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
In today’s age, there are constant streams of high-volume real-time data flowing from devices like smartphones, IoT devices, laptops, all these streams form Big Data, and the 5 V’s are important characteristics that help you identify what all to consider when data influx is scaling. Big data plays an instrumental role in many fields like artificial intelligence, business intelligence, data sciences, and machine learning where data processing (extraction-transformation-loading) leads to new insights, innovation, and better decision making. Big data breakdown also gives competitive advantages to those who do data analysis before decision-making over those who use traditional data to run their business. Solutions like Amazon Redshift will certainly provide an edge over relational databases for data warehousing while Spark and Kafka are promising solutions for the continuous streaming of data to the data warehouses.
That’s all folks!!