Big Data Analytics (BDA): Technology And Application Review

Download

Category Technology
Subcategory Computer Science
Topic Big Data

Words 2208
Pages 5

Introduction:

This is a STAR paper on Big Data Analytics (BDA) – i.e. a critical and reflective analysis of the current state of the art in the Big Data area. The core components of this paper are a technological review of the cutting-edge in BDA, as well as an application review that focuses on the application of BDA to one particular sector.

Big Data is a vague and loosely defined term; many academics and organisations have varying understandings of what actually constitutes ‘big’ data. However, it is commonly accepted that Big Data possesses at minimum, a core suite of at least 3 traits. These traits, detailed by Doug Laney in 2001, are often referred to as ‘the 3 Vs’: (Kitchin & McArdle, 2016)

Click to get a unique essay

Our writers can write you a new plagiarism-free essay on any topic

ORDER NOW

Volume: consisting of a large quantity of data
Velocity: generated at a high speed
Variety: comprised of many data types

Since then, others have ascribed other traits to the definition of Big Data, with varying rates of adoption into the general consensus of the definition, including but not limited to: (Kitchin & McArdle, 2016).

Value: including meaningful information
Veracity: varying in clarity/noise
Variability: shifting in meaning
Visualisation: can be visualised

There are no specific criteria that can categorically define Big Data as ‘big’ and not just regular data. In order to distinguish if some data is ‘big’, it must be considered in relation to the means by which it can have meaningful information extracted from it. For example, a dataset of petabytes (PB) that can have meaningful information extracted from it in 2 hours may be considered less ‘big’ than a dataset of terabytes (TB) that takes longer to extract from.

The extraction of useful information from Big Data can be split into 5 stages, which themselves form 2 main sub-processes: Big Data Management and Big Data Analytics. Big Data Management is comprised of the stages which procure, extract and represent Big Data in preparation for analysis. Big Data Analytics are the processes that perform modelling, analysis and ultimately obtain useful intelligence from the data. (Gandomi & Haider, 2015)

Technology Review:

Data comes in various formats, such as textual data, statistical data, multimedia and more. It is generated by a variety of sources, such as computers, mobile devices, social media, sensor networks, the Internet of Things, etc. All data falls into one of three main categories: structured data, semi-structured data or unstructured data. It has been widely believed that at least 80% of data in the world is unstructured.

In 2018, the International Data Corporation (IDC) estimated that the total amount of data in the world was around 33 zettabytes (ZB). They predicted that by 2025, that amount would grow to 175 ZB. (Seagate, 2018)

With so much complex data being generated seemingly exponentially, advanced BDA solutions must employ efficient, scalable and flexible technologies in order to tackle the ever-growing challenges that utilising Big Data provides. (Sivarajah, Kamal, Irani & Weerakkody, 2017)

Following is a breakdown of methods and technologies currently used within each stage of the Big Data processes, as outlined within Figure 1:

1. Acquiring Big Data for analysis first requires an accessible, distributed data source. This data source can be comprised of data of any structural type. Following the identification of a suitable data source, a data collection framework can be utilised in conjunction with an appropriate data collection protocol. The suitability of data collection protocols will depend on the structural type of the data to be acquired. Finally, a persistent data storage technology is required to record and store the newly acquired data. (Lyko, Nitzschke & Ngonga Ngomo, 2016)

Examples of data collection frameworks include Storm, Kafka and Flume. Examples of data collection protocols include Advanced Message Queuing Protocol (AMQP) and Java Message Service (JMS). Examples of persistent data storage technologies include cloud-based NoSQL databases and Hadoop Distributed File System (HDFS). (Lyko et al., 2016)

2. Data, especially Big Data, often includes ‘dirty’ data; i.e. poor-quality or erroneous data which affects the results of data analysis algorithms. Data cleaning is the process of identifying dirty data within extracted data and attempting to correct it. Common approaches include: (Tang, 2016)

Heuristic-Based Cleaning: Compares and amends unclean data with a similar but consistent secondary dataset. Fast but inaccurate with a low rate of error repair.
User-Guided Cleaning: As ‘heuristic-based’ but with human validation. Higher accuracy and rate of error repair but slower and more expensive.
Confidence Values: Similar to ‘heuristic-based’ but with pre-defined confidence values and a confidence threshold dictating the error amendments. Moderate accuracy, rate of error repair and cost.
Fixing Rules: As ‘confidence values’ but with a threshold determined by a complex algorithm. Faster with higher accuracy and rate of error repair but much more expensive.

3. To represent Big Data in a consumable form, data must be aggregated; i.e. be reduced in volume by combining data similarities and removing redundancies. This results in a compact and summarised version of the original data which eases visualisation and analysis. Common methods include: (ur Rehman et al., 2016)

Network (Graph) Theory: Evaluates links between data. Reduces unstructured data into structured data of lower dimensionality.
Compression: Reduces volume of data whilst maintaining its integrity and completeness. Compression methods include gZip, AST.
Data Deduplication: Removes redundant and duplicated data.

Following suitable aggregation, the data may be visualised. Data visualisation tools can be static, dynamic or interactive, however; Wang, Wang & Alexander (2015) found that interactive tools perform better than their static counterparts. Wang et al. (2015) also found that traditional data visualisation tools perform inadequately in comparison to dedicated Big Data visualisation tools. Examples of common Big Data visualisation tools include Hadoop Platfora, IBM Many Eyes and Tableau. Additionally, VR is becoming a powerful method of data visualisation, a method which Wang et al. (2015) claim will ‘facilitate Big Data visualisation greatly’.

4. To obtain valuable information from the data, it must undergo analysis. Various methods of data analysis can be performed, depending on the type of information desired: (Sivarajah et al., 2017)

Descriptive Analytics: data scrutiny to define the current state of a situation.
Inquisitive Analytics: data probing to rationalise a decision.
Predictive Analytics: data modelling to determine future possibilities.
Prescriptive Analytics: data testing for optimisation.
Pre-emptive Analytics: data analysis for precautionary measures.

5. Following analysis, analytical specialists may be required to interpret and translate the findings into layman’s terms for a non-specialised target audience. A challenge in this is the shortage of specialised people to fulfil this role.

Application Review:

With almost every business sector in the world generating huge volumes of varied data at high speed, BDA is very much multidisciplinary. In this section however, BDA will be reviewed with the medicine and healthcare industry as the focal area of application.

The medical industry generates a huge amount of heterogeneous data, from areas such as genomics, epigenomics, diatomics, etc. These areas produce raw data about complex biochemical and medical processes and are comprised of various data types. The nature of this data makes it difficult to manage and analyse with traditional methods, software and hardware. (Ristevski & Chen, 2018)

Big Data platforms, such as Apache Hadoop’s MapReduce module can be used to handle ‘big’ medical data through Massive Parallel Processing (MPP). This enables the application of data mining techniques such as anomaly detection, clustering, classification, etc. to analyse the complex heterogeneous data; potentially leading to the revelation of hidden patterns or other new knowledge. (Ristevski & Chen, 2018)

The application of BDA to medical data could be used to detect spreading diseases faster, generate insight into disease mechanisms, reveal health-based trends, improve treatment methods, etc. Additionally, the integration of processed medical data with data from other fields could lead to further discoveries. (Ristevski & Chen, 2018)

Some challenges of BDA within the medical industry include: (Ristevski & Chen, 2018)

The cost of Big Data and BDA implementations.
The difficulty of working with Big Data due to its inherent properties.
Variability in medical data due to its biological nature and variability in its environment.
Human error within data influenced by user entry, such as patient records.
A lack of enforced industry-wide standardisation.

Data sensitivity is paramount in the medical industry as medical data is extremely sensitive; stressing the need for high degrees of privacy and security. Any BDA implementation must respect and utilise advanced encryption methods, and implement pseudo-anonymisation for personal data. International laws may also restrict or outright prevent the use of some patient data. Any network used for BDA must be secure, and all access to the data must be authenticated. (Ristevski & Chen, 2018)

A study by Wang, Kung, Wang & Cegielski (2018) found that the medical institutions which benefit the most from BDA are top research hospitals or top medical schools with high-profit affiliations. Their study did not find any small medical organisations that could afford the necessary Big Data technologies to enjoy the benefits of BDA. A point made by Wang et al. (2018) was that new IT adoption in the medical industry lags behind most other industries; likely influencing the number of medical organisations that can implement BDA at this stage.

Conclusion and Recommendations:

Since 2001, Big Data has been characterised by the 3Vs definition; with more attributes being associated with it as time goes on and the Big Data boom continues. The Internet of Things has further accelerated the need for BDA and continues to do so, forcing BDA technologies to scale rapidly to accommodate the Data Deluge.

Massive parallel processing platforms such as Apache Hadoop and Google BigQuery are essential in modern and future BDA applications. Artificial Intelligence, Machine Learning and Data Mining all play key parts in the extraction and aggregation of Big Data, whilst technologies such as Virtual Reality help represent the increasingly complex data visualisations.

A continued exponential increase in data generation is inevitable, it is essential to continue to adapt and scale the technologies at our disposal or risk losing valuable intelligence extractable from the data we generate.

References:

Kitchin, R. & McArdle, G. (2016). What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets. Big Data & Society, 3(1), DOI: 10.1177/2053951716631130
Gandomi, A. & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), DOI: 10.1016/j.ijinfomgt.2014.10.007
Seagate. (2018). The Digitization of the World From Edge to Core. Retrieved from https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf
Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of Big Data challenges and analytical methods. Journal of Business Research, 70, DOI: 10.1016/j.jbusres.2016.08.001
Lyko, K., Nitzschke, M., & Ngonga Ngomo, A. C. (2016). Big Data Acquisition. New Horizons for a Data-Driven Economy, 1, 39-61. DOI: 10.1007/978-3-319-21569-3_4
Tang, N. (2014). Big Data Cleaning. Asia-Pacific Web Conference, Changsha, China, 1, 13-24. DOI: 10.1007/978-3-319-11116-2_2
ur Rehman, M. H., Liew, C. S., Abbas, A., Jayaraman, P. P., Wah, T. Y., & Khan, S. U. (2016). Big Data Reduction Methods: A Survey. Data Science and Engineering, 1(4), 265-284. DOI: 10.1007/s41019-016-0022-0
Wang, L., Wang, G., & Alexander, C. A. (2015). Big Data and Visualization: Methods, Challenges and Technology Progress. Digital Technologies, 1(1), 33-38. Retrieved from https://pdfs.semanticscholar.org/2975/4e4295a9ce4d51937c0712d6482634474628.pdf
Ristevski, B. & Chen, M. (2018). Big Data Analytics in Medicine and Healthcare. Journal of Integrative Bioinformatics, 15(3), DOI: 10.1515/jib-2017-0030
Wang, Y., Kung, L., Wang, W. Y. C., & Cegielski, C. G. (2018). An integrated big data analytics-enabled transformation model: Application to health care. Information & Management, 55(1), 64-79. DOI: 10.1016/j.im.2017.04.001

Bibliography:

Kitchin, R. & McArdle, G. (2016). What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets. Big Data & Society, 3(1), DOI: 10.1177/2053951716631130
Gandomi, A. & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), DOI: 10.1016/j.ijinfomgt.2014.10.007
Seagate. (2018). The Digitization of the World From Edge to Core. Retrieved from https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf
Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of Big Data challenges and analytical methods. Journal of Business Research, 70, DOI: 10.1016/j.jbusres.2016.08.001
Lyko, K., Nitzschke, M., & Ngonga Ngomo, A. C. (2016). Big Data Acquisition. New Horizons for a Data-Driven Economy, 1, 39-61. DOI: 10.1007/978-3-319-21569-3_4
Sagiroglu, S. & Sinanc, D. (2013). Big data: A review. 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, CA, USA, DOI: 10.1109/CTS.2013.6567202
Chen, J., Chen, Y., Du, X., Li, C., Lu, J., Zhao, S., & Zhou, X. (2013). Big data challenge: a data management perspective. Frontiers of Computer Science, 7(2), 157-164. DOI: 10.1007/s11704-013-3903-7
Tang, N. (2014). Big Data Cleaning. Asia-Pacific Web Conference, Changsha, China, 1, 13-24. DOI: 10.1007/978-3-319-11116-2_2
Feng, Z., Hui-Feng, X., Dong-Sheng, X., Yong-Heng, Z., & Fei, Y. (2013). Big Data Cleaning Algorithms in Cloud Computing. iJOE, 9(3), 77-81. DOI: 10.3991/ijoe.v9i3.2765
ur Rehman, M. H., Liew, C. S., Abbas, A., Jayaraman, P. P., Wah, T. Y., & Khan, S. U. (2016). Big Data Reduction Methods: A Survey. Data Science and Engineering, 1(4), 265-284. DOI: 10.1007/s41019-016-0022-0
Salem, R. & Abdo, A. (2016). Fixing rules for data cleaning based on conditional functional dependency. Future Computing and Informatics Journal, 1(1-2), 10-26. DOI: 10.1016/j.fcij.2017.03.002
Wilkinson, L. (2017). Visualizing Big Data Outliers Through Distributed Aggregation. IEEE Transactions on Visualization and Computer Graphics, 24(1), 256-266. DOI: 10.1109/TVCG.2017.2744685
Boubiche, S., Boubiche, D. E., Bilami, A., & Toral-Cruz, H. (2018). Big Data Challenges and Data Aggregation Strategies in Wireless Sensor Networks. IEEE Access, 6, 20558-20571. DOI: 10.1109/ACCESS.2018.2821445
Wang, L., Wang, G., & Alexander, C. A. (2015). Big Data and Visualization: Methods, Challenges and Technology Progress. Digital Technologies, 1(1), 33-38. Retrieved from https://pdfs.semanticscholar.org/2975/4e4295a9ce4d51937c0712d6482634474628.pdf
Ristevski, B. & Chen, M. (2018). Big Data Analytics in Medicine and Healthcare. Journal of Integrative Bioinformatics, 15(3), DOI: 10.1515/jib-2017-0030
Wang, Y., Kung, L., Wang, W. Y. C., & Cegielski, C. G. (2018). An integrated big data analytics-enabled transformation model: Application to health care. Information & Management, 55(1), 64-79. DOI: 10.1016/j.im.2017.04.001