De-jargonizing the big buzz on Big Data
“Big Data” is the buzz word of the moment and it is one of the most searched phrases in Google. People from many walks of life, from young graduates to industry veterans, try to get a grip around this phenomenon called “Big Data”. The number of white papers, articles, opinionnaires going online on this topic is increasing rapidly. Most of these writings are made by renowned scholars in the field of information management and hence the benchmark for these writings is far above what the common man could decipher. This makes the common crowd miss out on a clear understanding of the big data phenomenon and this writing exactly does (or aim to do) that, i.e., to present a very simple and comprehensible view of Big Data for the readers who find it difficult to dissect the scholarly articles on Big Data.
Being a budding practitioner in the field of data management & analytics, I have been trying to get a holistic picture of the Big Data developments for quite some time. I read and read, pondered and pondered over long articles and white papers on Big Data. This writing is a culmination of whatever little I could take away from all those papers. An earnest attempt is made to present it in a very simple & understandable format. While this reading could help you to get a starting and holistic understanding to the Big Data phenomenon, it is no way intended to make you masters of the game.
What was the past?
With the advancement of information era, individuals and institutions started moving away from paper based records to customized technology systems to capture data and make easy n fast computations. This was similar to the shift we made from paper mails to electronic mails (email) – be it rediff mail or yahoo mail. This was the first shift.
As time progressed, the number of systems increased so much that organizations found it difficult to get a holistic picture across these systems. This was very much essential for the top management decision making and hence came the time of data warehousing. This started back in 1980’s and became popular in late 90’s. Institutions started using data warehousing and business intelligence tools. Similarly many of us too gave an attempt at data ware housing. We had multiple email id’s across the rediff’s, yahoo’s, hotmail’s n the gmail’s. Logging into all these email servers was time consuming and tedious. So we started using tools like Thunderbird or email POP3 or auto-forwards – which helped us to get all the different email accounts at the same place.
With the blistering pace of e-commerce and technology the data grew from MB’s to GB’s to TB’s. Similarly our email boxes also grew from MB’s to GB’s. The type of data also started evolving, in addition to the organized data, the advent of social networking, blogs and tweets gave rise to unstructured data. Streaming data also started with the advent in electronic trading in the form of market updates, videos and so on. The Facebook’s, Twitter’s, YouTube’s n likes started claiming most of our time and hence data generation. The accelerated growth of data in this scale started making storage and maintenance of data very costly.
What is the Present?
In addition to the costs associated with maintaining data, the biggest question to institutions was ‘What to do with this stored data?’. When the Google’s n IBM’s of the world started thinking about getting some returns out of the large data in hand at lesser cost – there started the era of Big Data.
First of all, what is big data? Is it the name of a software or hardware used for handling data in big scale? Absolutely not, Big data is a combination of hardware, firmware and software which can effectively handle massive amounts of data, in various forms, getting generated at a very high speed. Hence our scholars define big data mainly by the 3V’s – Volume (of data), Variety (of data) and Velocity (of data). It is precisely these 3 specialties of data that makes big data differ from the traditional data handling tools and hence we call it BIG DATA.
The first ‘V’, Volume of data is self-explanatory and relates to the data piling up for the past decade or so in our institutions. This data build up started with the information technology advancement in the late 90’s and now it has grown to mammoth scale that institutions have started looking at some way to utilize the data and big data is seen as the right solution. Next come Variety of data, which refers to the type of data that big data is expected to handle. The easy way to decipher this is to remember 3 things: the format of our bank account statement which is well labelled and organised into our deposits and withdrawals – called structured, the data that you type into your Facebook wall which contains anything ranging from links to poems to stories to your hangouts – called unstructured (accounting for nearly 80% of generated data) and finally the stock prices that scroll in the bottom part of our CNN-IBN channel – called stream. Last of the 3V’s – Velocity of data refers to the speed at which this all the above types of data is getting generated. Every day almost 2,700,000,000 likes go on FB, 175,000,000 tweets go on the web. Add on to it the bloggers, photographers, searchers, e-shoppers, e-mailers and so on which generate enormous amount of data in no time. The traditional data ware house comes under the structured type of data and hence Big Data in no way eliminates the need for traditional data warehousing, but just includes it in a bigger data set and takes it to the next level.
So far we discussed about the characteristics of data. Now what do we do with this massive, humongous chunks of data? There comes Analytics. Analytics is a term which literally represents analyzing the data and deriving insights. These insights could be statistics about the past, trends in the present or predictions about the future. Depending on the insights or output derived – analytics is classified into different types, viz, text analytics, content analytics, data mining, web analytics, social analytics, sentiment analytics, predictive analytics and so on. Stream computing is another advancing type of analytics where because of the need for real time insights – streams of data is analyzed before storing itself and recommendations are made.
To link Big Data Analytics back to our personal level, imagine yourselves running a company which sells products online. You already have 3 email accounts where customers write their feedback, you have Facebook and twitter presence in social media and you have a data ware house which stores the data related to the leads, sales and promotions for the past 5 years. You want to understand the customer value to company (CVC) based on the purchases in the past, product affinity and sentiments expressed through social media and emails. This CVC score will be used to generate customized promotions for customers thereby cross selling and up-selling the products. This will be a clear case for Big Data analytics and will apply the processes that we discussed above.
This development in Big Data Analytics was not without the advancement of software and hardware. Both developed side by side, complementing Big Data. Software catered to the type of analytics – like SPSS for predictive analytics and Infosphere Insights for stream computing (both of these are IBM products). Hardware development was focused on the computation power of hardware and fine tuning it for bulk data handling – Netezza for structured data analytics and Hadoop File System for unstructured data. As you read this writing, more and more research is done in this area.
What is the Future?
Future developments in analytics are attempted at mimicking the human involvement in the data driven decision making process. Cognitive computing is the advancement that companies like IBM is aiming towards – machines capable of making judgments, reasoning and decisions. IBM Watson is one attempt in this direction. Successful implementation of these advancements will determine the time for these technologies to break through and spread across the different industries. It won’t be long before we see computers changing from the role of computational devices to intelligent machines, as we see in Hollywood movies.