azure terms Edit
language code identifier (LCID) The LCID identifies the language that the data uses.
underutilization An organization will be charged for a service that a cloud administrator provisions but doesn't use. This scenario is called underutilization.
Azure Cosmos DB emulator Developers can use tools like the Azure Cosmos DB emulator or the Azure Storage emulator to develop and test cloud applications without incurring production costs.
lift and shift When moving to the cloud, many customers migrate from physical or virtualized on-premises servers to Azure Virtual Machines. This strategy is known as lift and shift. Server administrators lift and shift an application from a physical environment to Azure Virtual Machines without re-architecting the application.
SQL Server professionals vs Data engineers SQL Server professionals generally work only with relational database systems. Data engineers also work with unstructured data and a wide variety of new data types, such as streaming data.
ETL vs ELT in Extract Transform Load data is extracted from structured or unstructured data poll and migrated to a staging data repository. you might have to transform the data from the source schema to the destination schema and then load the data into a data warehouse. the disadvantage of ETL is transformation can take time, This stage can potentially tie up source system resources. In extract, load, and transform (ELT) data is immediately extracted and loaded into a large data repository such as Azure Cosmos DB or Azure Data Lake Storage. This change in the process reduces the resource contention on source systems.
ETL contains Extract During the extraction process, data engineers define the data and its source: resource group, subscription, and identity information such as a key or secret, Define the data Identify the data to be extracted. Define data by using a database query, a set of files, or an Azure Blob storage name for blob storage.
Data transformation operations can include splitting, combining, deriving, adding, removing, or pivoting columns. Map fields between the data source and the data destination. You might also need to aggregate or merge data. Load involves, Define the destination, Start the job & Monitor the job. Another process like ELT is called extract, load, transform, and load (ELTL). The difference with ELTL is that it has a final load into a destination system.
Structured vs Nonstructured data In relational database systems like Microsoft SQL Server, Azure SQL Database, and Azure SQL Data Warehouse, data structure is defined at design time. Data structure is designed in the form of tables. Relational systems typically use a querying language such as Transact-SQL (T-SQL). Examples of nonstructured data include binary, audio, and image files. Nonstructured data is stored in nonrelational systems, commonly called unstructured or NoSQL systems. In nonrelational systems, the data structure isn't defined at design time, and data is typically loaded in its raw format. The data structure is defined only when the data is read. Nonrelational systems can also support semistructured data such as JSON file formats.
Holistic data engineering Organizations are changing their analysis types to incorporate predictive and preemptive analytics. Because of these changes, as a data engineer you should look at data projects holistically. Data professionals used to focus on ETL, but developments in data platform technologies lend themselves to an ELT approach.
Design data projects in phases that reflect the ELT approach:
- Source: Identify the source systems to extract from.
- Ingest: Identify the technology and method to load the data.
- Prepare: Identify the technology and method to transform or prepare the data.
- Analyze: Identify the technology and method to analyze the data.
- Consume: Identify the technology and method to consume and present the data.
Azure Storage Azure Storage accounts are the base storage type within Azure. It can also provide a messaging store for reliable messaging, or it can act as a NoSQL store.You can use Azure Storage as the storage basis when you're provisioning a data platform technology such as Azure Data Lake Storage and HDInsight
Azure Storage offers four configuration options
- Azure Blob
- Azure Files
- Azure Queue
- Azure Table
Data ingestion To ingest data into your system, use Azure Data Factory, Storage Explorer, the AzCopy tool, PowerShell, or Visual Studio. If you use the File Upload feature to import file sizes above 2 GB, use PowerShell or Visual Studio. AzCopy supports a maximum file size of 1 TB and automatically splits data files that exceed 200 GB.
Cosmos DB planet scale database, improvised version of documentdb.
- Planet Scale
- NO SQL
- JSON DB
- Multi API
- Strong Consistency, Read client will not be allowed to read the data which is not committed to all the geographical areas.
- Bounded Staleness, can set the staleness to 2 hours, so all the clients including primary, data will be lagging by 2 hours. if you set the bounded staleness to 0 it becomes strong consistency.
- Session Consistency, Users from primary will see the fresh data, other regions will see eventual consistency
- Consistent Prefix - same sequence it was committed will be read in all regions
- Eventual
- SQL API
- Mongo API
- Graph API
- Table API
- Cassandra
- Database
- Collection
- Documents
- Json rows
Failovers : Manual / Automatic, If write region fails, make a read region write region, can give priority if there are multiple read regions for failover Data ingestion To ingest data into Azure Cosmos DB, use Azure Data Factory, create an application that writes data into Azure Cosmos DB through its API, upload JSON documents, or directly edit the document. Azure Cosmos DB supports data encryption, IP firewall configurations, and access from virtual networks. Data is encrypted automatically. User authentication is based on tokens, and Azure Active Directory provides role-based security.
Azure Synapse Analytics is a cloud-based data platform that brings together enterprise data warehousing and Big Data analytics. It can process massive amounts of data and answer complex business questions with a limitless scale. used for exploratory data analysis to identify initial patterns or meaning in the data. It can also include conducting predictive analytics for forecasting or segmenting data.
SQL Pools uses massively parallel processing (MPP) to quickly run queries across petabytes of data. Because the storage is separated from the compute nodes, you can scale the compute nodes independently to meet any demand at any time.
In Azure Synapse Analytics, the Data Movement Service (DMS) coordinates and transports data between compute nodes as necessary. But you can use a replicated table to reduce data movement and improve performance. Azure Synapse Analytics supports three types of distributed tables: hash, round-robin and replicated. Use these tables to tune performance.
Importantly, Azure Synapse Analytics can also pause and resume the compute layer. This means you pay only for the computation you use. This capability is useful in data warehousing.
Ingestion: Azure Synapse Analytics uses the extract, load, and transform (ELT) approach for bulk data. you can use PolyBase to copy data. You can use Transact-SQL to query the contents of Azure Synapse Analytics.
Stream Analytics is useful if an organization must respond to data events in real-time or analyze large batches of data in a continuous time-bound stream. In real-time, data is ingested from applications or IoT devices and gateways into an event hub or IoT hub. The event hub or IoT hub then streams the data into Stream Analytics for real-time analysis. Data ingestion sources include Azure Event Hubs, Azure IoT Hub, and Azure Blob storage. To process streaming data, set up Stream Analytics jobs with input and output pipelines. Inputs are provided by Event Hubs, IoT Hubs, or Azure Storage. Stream Analytics can route job output to many storage systems. These systems include Azure Blob, Azure SQL Database, Azure Data Lake Storage, and Azure Cosmos DB. After storing the data, run batch analytics in Azure HDInsight. Or send the output to a service like Event Hubs for consumption. Or use the Power BI streaming API to send the output to Power BI for real-time visualization.
Databricks
Data Factory Data Factory is a cloud-integration service. It orchestrates the movement of data between various data stores.
Data Catalog Data Catalog is a fully managed cloud service. Users discover and explore data sources, and they help the organization document information about their data sources.
data wrangling A data engineer's scope of work goes well beyond looking after a database and the server where it's hosted. Data engineers must also get, ingest, transform, validate, and clean up data to meet business requirements. This process is called data wrangling.
Data EngineerData engineers primarily provision data stores. They make sure that massive amounts of data are securely and cost-effectively extracted, loaded, and transformed.
- Design and develop data storage and data processing solutions for the enterprise.
- Set up and deploy cloud-based data services such as blob services, databases, and analytics.
- Secure the platform and the stored data. Make sure only the necessary users can access the data.
- Ensure business continuity in uncommon conditions by using techniques for high availability and disaster recovery.
- Monitor to ensure that the systems run properly and are cost-effective.