Streaming Data Technology and Big Data Application Analysis

Preface

[March 05, 2015] Nowadays we are living in a digital world. Information Communication Technology (ICT) is closely linked with our lives and has important effects on our lives. When we delve into the digital world, we find that due to the Internet, the variety of data, the amount of data, and the increasing speed of data movement are growing explosively. IDC predicts that the amount of data generated by in digital world will surge to 8 zettabytes (see note 1) in 2015. This data explosion has been described as the Big Data Problem. In fact, the difficulty is how to collect all of this large and complex data, then how to filter it, aggregate it, and finally how to transmit it to a Big Data processing platform. Given big data's growth trend, we see that existing application systems that analyze and read Big Data will soon encounter difficulty as the Big Data Problem itself becomes even Bigger.

At present, government and business organizations are faced with the information technology challenge of handling data amounts in the terabytes and petabytes. IDC estimate that the amount of information in the world will double every two years. In 10 years, global servers will have grown to 10 times their current numbers, information stored in our enterprise data centers will have grown to 50 times today's amount, and processing of data files will have grown 75 times. These data are mainly from Internet search, social media, mobile devices, the Internet of Things (IOT), commercial transactions, scientific experimental data and content dissemination. Based on the diversity of data sources and the change in the speed of data generation and sheer amount of data, the requirements of dynamic real-time Streaming Data collection and integration will be growing, especially considering the killer application we call IOT. Given this scenario, the processing of streaming data is a critically important subject.

In the Big Data field, streaming big data consists of a number of messages and events created by various real-time data sources where the data is being continuously generated. The continuous accumulation results in today's Big Data Problem. The challenge is how we might continuously collect and process this huge data yet efficiently and cost-effectively. The challenge is made greater given that data is collected from many data sources and must then be rapidly transmitted to stable processing platforms. Further, these platforms must support a mixture technologies. Solutions for processed streaming big data need to conform with the following points as we list here.

Streaming Big Data

In the Big Data field, Streaming Big Data consisted of a number of message and event created by different real-time data source continuously manufacturing, and continuous accumulation of huge data. Although it is a better way that collecting and integrating the message and event with Streaming Data type, but the main challenge is how to generate continuous treatment of the emergence of a large number of data and how to effective and cost-effective to collect different data, and transmits it to the rapid and stable processing platform for different technology. For solution processed Streaming BIG Data, need to accord with the following matters.

High performance and real-time processing capability, which can simultaneously handle large numbers of data streams.
Dramatic scalability, which can meet future demand for the expected explosive growth of big data.
Data collection must meet various data formats, support a variety of data sources, and destination platforms.
High availability in the event of an exception, which automatically triggers near zero latency fail-over, providing guaranteed data delivery.
High transmission efficiency, where the data in the transmission process (In-Flight) must avoid unnecessary data copying and storage.
Support for simple settings, deployment, management, and monitoring mechanisms, which are configurable with a graphical interface.

Present Day Big Data Streaming Solutions

Apache Chukwa

Hadoop Chukwa is a sub-program, which focuses on a large number of log collection and analysis applications. Chukwa is a WAL (Write Ahead Log) architecture, its development is based on HDFS (Hadoop Distributed File System) and the MapReduce framework. It inherits Hadoop’s existing scalability and stability. In order to use collected data optimally, Chukwa also has powerful tools for data analysis and monitoring.

Apache Flume

Flume collects and summarizes large amounts of log data from various data sources and assembles these log data into a single data store. Flume uses a WAL (Write Ahead Log) distributed system architecture with high reliability and high availability.

Apache Kafka

Originally developed by the LinkedIn Company, Kafka is now a sub-project of Apache. It uses a distributed system, and with high scalability and a broker-based and store-and-forward design, which supports the publish/subscribe mechanism, provides high processing capacity, and supports multiple subscribers, and automatically maintains a balance for consumers. Kafka supports the persistence function, whereby data is stored on disk. It further supports batch operation using the persistence function.

Facebook Scribe

real-time, large numbers of server log data as it is generated. Scribe uses a highly scalable broker-based and store-and-forward design. When scaling up it does not affect the client-side. Additionally its high availability capabilities make it unaffected by network or equipment failure.

Fluentd

Fluentd is an open-source data collector, which collects and processes log data organization-wide. Fluentd handles the log data using JSON (JavaScript Object Notation) and its primary feature allows user to customize and modify its function according to their needs.

In the commercial software industry, for many years manufacturers including IBM, Informatica, SAP, Splunk, Tibco, and others have been focusing on streaming data solutions in the field of big data. Compared with others, these solutions are more focused on handling log data (such as Kafka, Flume, Scribe). Likewise commercial software products, of which the sources and formats are extensive, have functions and transmission efficiency that are more powerful. Particularly, Informatica VDS is based on Ultra Messaging products with a low-latency information exchange feature, which provides significant advantages in performance and usability.

Conclusions

Solutions using streaming data technology allow us to get the latest and most real-time data and respond immediately, thereby taking measures to achieve maximum operational efficiency and reducing operational risk.

There are many fine examples including these: In the financial industry, value is added when placing orders and controlling risk according to real-time market data; In the transportation industry streaming data technology provides views of the current transport network efficiency and abnormal events, whereby appropriate allocation and action may be taken; Airline companies achieve real-time preventive warning, repair and maintain the aircraft for just landing by knowing their airplanes’ operation state and flight position; The telecommunications industry obtains operational information about customers, accounts, networks, services in real-time; The manufacturing industry deploys factory equipment capable of remote monitoring to maintain optimal production efficiency; The energy industry monitors electricity usage trends, anomalies and alert patterns in real-time according to smart meters.

For example, Financial Industry can place orders and control risk according to the real-time market data after value added; Transportation Industry can know the current transport network efficiency and abnormal events, and make the appropriate allocation. Airlines can know all airplanes’ operation state are flying in the sky, achieve real-time preventive warning, repair and maintain the aircraft for just landing; Telecommunications Industry can obtain operational information about customers, accounts, networks, services in real-time; Manufacturing Industry can deploy the equipment of remote factory carefully, to maintain optimal production efficiency; Energy Industry can monitor electricity usage trends, anomalies and alert patterns in real time according to smart meters.

(Note1)1 zettabyte = 1 billion terabytes；

1 zettabyte = 1,000 exabytes；1 exabyte = 1,000 petabytes；1 petabyte = 1,000 terabytes。

References

1、R. Ranjan, “Streaming Big Data Processing in Datacenter Clouds”. IEEE Cloud Computing 1(1): 78-83 (2014)

2、2011 IDC Digital Universe Study

3、http://rajivranjan.net/research-directory/big-data-management-and-processing-data-centre-clouds/

4、https://chukwa.apache.org/

5、http://flume.apache.org/

6、http://www.fluentd.org/architecture

7、http://kafka.apache.org/

8、http://en.wikipedia.org/wiki/Scribe_(log_server)

9、http://www.informatica.com