Azure Data Factory, a Hybrid Data Integration Service

Azure Data Factory is a hybrid data integration service that is managed by the cloud. Azure Data Factory was built for complex hybrid integration projects and allows to create data pipelines for managing extract-load-transform and extract-transform-load operations. The platform orchestrates and automates data transformations and data movements. With Azure data factory, developers can create and arrange pipelines – data-driven workflows which can ingest data from separate data stores. Azure Data Factory uses such services as Azure Machine Learning, Azure Data Lake, Spark, Hadoop, Azure HDInsight to process the data and transform it.

 

Besides, users can publish data output to SQL Data Warehouse for different BI applications to consume. Azure Data Factory allows to organize raw data and create meaningful data lakes and data stores to make better business decisions.

 

The pipelines usually perform the following operations:

 

  • Connect and collect is the first step in building an information production system. All data sources must be collected and moved to a centralized location where they will be processed. Developers can use the Copy Activity to move data from cloud, and on-premise data stores to a central data store in the cloud where they will be analyzed.
  • Transform and enrich. At this step, all data are processed and transformed into collected data with the help of various services such as HDInsight Hadoop and Spark, Machine Learning, Data Lake Analytics.
  • Publish. The business-ready data must be loaded into Azure CosmosDB, Azure SQL Database, Azure data Warehouse or another analytics engine.
  • Monitor. Developers need to monitor the pipelines and scheduled activities to determine success and failure rates. Developers can use monitoring features of Microsoft Operation Management Suite, PowerShell, API, and Azure Monitor.

 

One Azure subscription allows several data factories. Each data factory is made up of the following components that work together: pipelines, activities, datasets, triggers, linked services, parameters, pipeline runs, and control flow.