As you remember, I’m writing these notes to share my progress and thoughts about Microsoft Professional Certification Program on Big Data with you. In this post, I would like to tell you more about Big Data architecture style. So let’s get started!
Basically, a big data architecture is designed to manage processing and analysis of data flows that are too large and complex and cannot be handled by traditional database systems. As you may know, majority of big data architectures include some or all of the following elements:
- Data sources such as relational databases, server log files, data from IoT devices.
- Distributed data storages that can hold high volumes of large files in various formats for batch processing operations that are often called data lakes.
- Batch processing which is performed by running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster.
- Real-time message ingestion stores that act as a buffer for messages to support scale-out processing, reliable delivery, and other message queuing semantics. Options for this operation can include Azure Event Hubs, Azure IoT Hubs, and Kafka.
- Stream processing service is based on constantly running SQL queries that operate on unbounded streams. This service is provided by Azure Stream Analytics, but you can also use open source Apache streaming technologies like Storm and Spark Streaming in an HDInsight cluster.
- Analytical data stores are used to store historical data on business metrics that can be queried by using analytical tools. The ADS, that is used to serve these queries, can be a Kimball-style relational data warehouse. Alternatively, the data could be presented through a low-latency NoSQL technology such as HBase, or an interactive Hive database. Azure SQL Data Warehouse, for example, is used to provide a managed service for large-scale, cloud-based enterprise data warehousing.
- Analysis and reporting in architecture can be performed by a data modeling layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. Analysis and reporting can also take the form of interactive data exploration with help of many Azure services that support analytical notebooks, such as Jupyter. For large-scale data analysis, you can use Microsoft R Server, either standalone or with Spark.
- Orchestration of repeated data processing operations can be performed by using Azure Data Factory or Apache Oozie and Sqoop.
When you should consider using this architecture.
You can think of applying this architecture style when you need to:
- Store and process data in volumes that exceed traditional database capabilities.
- Transform data to make it structured for further analysis and reporting.
- Capture, process, and analyze data flows in real time.
- Use Azure Machine Learning or Microsoft Cognitive Services.
Benefits you can get using this architecture.
- You can make technology choices depending on the project goals.
- You can enable high-performance solutions that scale to large volumes of data.
- You can adjust your solution to small or large workloads, and pay only for the resources that you use.
- You can create an integrated solution across data workloads.
Challenges you can face when working with this Big Data.
- It can be quite a challenging business to build, test, and troubleshoot big data processes.
- Many big data technologies are highly specialized, and use frameworks and languages that are not typical.
- Many technologies that are used in big data constantly evolve and introduce extensive changes and enhancements with each new release.
- Assuring a secure access to all the data in a centralized data storage can be challenging, especially when the data must be accessed and processed by multiple applications and platforms.
Best practices you should follow while using this architecture.
- Leverage parallelism by storing data in splittable formats. This will help you optimize performance and reduce overall work time.
- Partition data to simplify data ingestion and job scheduling what will make it easier to troubleshoot failures.
- Apply schema-on-read semantics to build flexibility into the solution and prevent bottlenecks which might arise during data ingestion. The bottlenecks can be caused due to data validation and type checking.
- Processing data within the distributed data store in-place helps transform it to the required structure before moving the data into an analytical data store. Balance utilization and time costs to make sure your resources are used in the most efficient way.
- Separate cluster resources for achieving better performance.
- Orchestrate data ingestion to achieve results in a predictable and centrally manageable fashion.
- Scrub sensitive data early to avoid storing it in the data lake.
Thank you for attention, guys, stay tuned!