-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Fabric Lakehouse Skill #741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tedvilutis
wants to merge
21
commits into
github:main
Choose a base branch
from
tedvilutis:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+332
−0
Open
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
2dcc97d
Fabric Lakehouse Skill
tedvilutis 6181395
Update skills/fabric-lakehouse/references/pyspark.md
tedvilutis 86d4e77
Update skills/fabric-lakehouse/SKILL.md
tedvilutis cf91a62
Update skills/fabric-lakehouse/SKILL.md
tedvilutis e00ee4d
Merge branch 'main' into main
tedvilutis 3f9e9b0
Update pyspark.md
tedvilutis 46f4918
Update skills/fabric-lakehouse/SKILL.md
tedvilutis 15e245c
Update skills/fabric-lakehouse/references/pyspark.md
tedvilutis b1a9d7c
Update skills/fabric-lakehouse/SKILL.md
tedvilutis d5d303b
Update skills/fabric-lakehouse/SKILL.md
tedvilutis c61ffdf
Update skills/fabric-lakehouse/SKILL.md
tedvilutis e0c7e41
Update skills/fabric-lakehouse/SKILL.md
tedvilutis 5217b16
Update skills/fabric-lakehouse/SKILL.md
tedvilutis c789c49
Update skills/fabric-lakehouse/SKILL.md
tedvilutis 6707f34
Update skills/fabric-lakehouse/SKILL.md
tedvilutis c8d1718
Refine description of Fabric Lakehouse skill
tedvilutis 41b34b1
Update skills/fabric-lakehouse/SKILL.md
tedvilutis 4b7ad71
Update README.skills.md
tedvilutis 178fed8
Update skills/fabric-lakehouse/references/pyspark.md
tedvilutis 0de738c
Update skills/fabric-lakehouse/SKILL.md
tedvilutis 3b907f7
Update skills/fabric-lakehouse/SKILL.md
tedvilutis File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| --- | ||
| name: fabric-lakehouse | ||
| description: 'Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices.' | ||
| metadata: | ||
| author: tedvilutis | ||
| version: "1.0" | ||
| --- | ||
|
|
||
| # When to Use This Skill | ||
|
|
||
| Use this skill when you need to: | ||
| - Generate a document or explanation that includes definition and context about Fabric Lakehouse and its capabilities. | ||
| - Design, build, and optimize Lakehouse solutions using best practices. | ||
| - Understand the core concepts and components of a Lakehouse in Microsoft Fabric. | ||
| - Learn how to manage tabular and non-tabular data within a Lakehouse. | ||
|
|
||
| # Fabric Lakehouse | ||
|
|
||
| ## Core Concepts | ||
|
|
||
| ### What is a Lakehouse? | ||
|
|
||
| Lakehouse in Microsoft Fabric is an item that gives users a place to store their tabular data (like tables) and non-tabular data (like files). It combines the flexibility of a data lake with the management capabilities of a data warehouse. It provides: | ||
|
|
||
| - **Unified storage** in OneLake for structured and unstructured data | ||
| - **Delta Lake format** for ACID transactions, versioning, and time travel | ||
| - **SQL analytics endpoint** for T-SQL queries | ||
| - **Semantic model** for Power BI integration | ||
| - Support for other table formats like CSV, Parquet | ||
| - Support for any file formats | ||
| - Tools for table optimization and data management | ||
|
|
||
| ### Key Components | ||
|
|
||
| - **Delta Tables**: Managed tables with ACID compliance and schema enforcement | ||
| - **Files**: Unstructured/semi-structured data in the Files section | ||
| - **SQL Endpoint**: Auto-generated read-only SQL interface for querying | ||
| - **Shortcuts**: Virtual links to external/internal data without copying | ||
| - **Fabric Materialized Views**: Pre-computed tables for fast query performance | ||
|
|
||
| ### Tabular data in a Lakehouse | ||
|
|
||
| Tabular data in a form of tables are stored under "Tables" folder. Main format for tables in Lakehouse is Delta. Lakehouse can store tabular data in other formats like CSV or Parquet, these formats are only available for Spark querying. | ||
| Tables can be internal, when data is stored under "Tables" folder, or external, when only reference to a table is stored under "Tables" folder but the data itself is stored in a referenced location. Tables are referenced through Shortcuts, which can be internal (pointing to another location in Fabric) or external (pointing to data stored outside of Fabric). | ||
|
|
||
| ### Schemas for tables in a Lakehouse | ||
|
|
||
| When creating a lakehouse, users can choose to enable schemas. Schemas are used to organize Lakehouse tables. Schemas are implemented as folders under the "Tables" folder and store tables inside of those folders. The default schema is "dbo" and it can't be deleted or renamed. All other schemas are optional and can be created, renamed, or deleted. Users can reference a schema located in another lakehouse using a Schema Shortcut, thereby referencing all tables in the destination schema with a single shortcut. | ||
|
|
||
| ### Files in a Lakehouse | ||
|
|
||
| Files are stored under "Files" folder. Users can create folders and subfolders to organize their files. Any file format can be stored in Lakehouse. | ||
|
|
||
| ### Fabric Materialized Views | ||
|
|
||
| Set of pre-computed tables that are automatically updated based on a schedule. They provide fast query performance for complex aggregations and joins. Materialized views are defined using PySpark or Spark SQL and stored in an associated Notebook. | ||
|
|
||
| ### Spark Views | ||
|
|
||
| Logical tables defined by a SQL query. They do not store data but provide a virtual layer for querying. Views are defined using Spark SQL and stored in Lakehouse next to Tables. | ||
|
|
||
| ## Security | ||
|
|
||
| ### Item access or control plane security | ||
|
|
||
| Users can have workspace roles (Admin, Member, Contributor, Viewer) that provide different levels of access to Lakehouse and its contents. Users can also get access permission using sharing capabilities of Lakehouse. | ||
|
|
||
| ### Data access or OneLake Security | ||
|
|
||
| For data access use OneLake security model, which is based on Microsoft Entra ID (formerly Azure Active Directory) and role-based access control (RBAC). Lakehouse data is stored in OneLake, so access to data is controlled through OneLake permissions. In addition to object-level permissions, Lakehouse also supports column-level and row-level security for tables, allowing fine-grained control over who can see specific columns or rows in a table. | ||
|
|
||
|
|
||
| ## Lakehouse Shortcuts | ||
|
|
||
| Shortcuts create virtual links to data without copying: | ||
|
|
||
| ### Types of Shortcuts | ||
|
|
||
| - **Internal**: Link to other Fabric Lakehouses/tables, cross-workspace data sharing | ||
| - **ADLS Gen2**: Link to ADLS Gen2 containers in Azure | ||
| - **Amazon S3**: AWS S3 buckets, cross-cloud data access | ||
| - **Dataverse**: Microsoft Dataverse, business application data | ||
| - **Google Cloud Storage**: GCS buckets, cross-cloud data access | ||
|
|
||
| ## Performance Optimization | ||
|
|
||
| ### V-Order Optimization | ||
|
|
||
| For faster data read with semantic model enable V-Order optimization on Delta tables. This presorts data in a way that improves query performance for common access patterns. | ||
|
|
||
| ### Table Optimization | ||
|
|
||
| Tables can also be optimized using the OPTIMIZE command, which compacts small files into larger ones and can also apply Z-ordering to improve query performance on specific columns. Regular optimization helps maintain performance as data is ingested and updated over time. The Vacuum command can be used to clean up old files and free up storage space, especially after updates and deletes. | ||
|
|
||
| ## Lineage | ||
|
|
||
| The Lakehouse item supports lineage, which allows users to track the origin and transformations of data. Lineage information is automatically captured for tables and files in Lakehouse, showing how data flows from source to destination. This helps with debugging, auditing, and understanding data dependencies. | ||
|
|
||
| ## PySpark Code Examples | ||
|
|
||
| See [PySpark code](references/pyspark.md) for details. | ||
|
|
||
| ## Getting data into Lakehouse | ||
|
|
||
| See [Get data](references/getdata.md) for details. | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| ### Data Factory Integration | ||
|
|
||
| Microsoft Fabric includes Data Factory for ETL/ELT orchestration: | ||
|
|
||
| - **180+ connectors** for data sources | ||
| - **Copy activity** for data movement | ||
| - **Dataflow Gen2** for transformations | ||
| - **Notebook activity** for Spark processing | ||
| - **Scheduling** and triggers | ||
|
|
||
| ### Pipeline Activities | ||
|
|
||
| | Activity | Description | | ||
| |----------|-------------| | ||
| | Copy Data | Move data between sources and Lakehouse | | ||
| | Notebook | Execute Spark notebooks | | ||
| | Dataflow | Run Dataflow Gen2 transformations | | ||
| | Stored Procedure | Execute SQL procedures | | ||
| | ForEach | Loop over items | | ||
| | If Condition | Conditional branching | | ||
| | Get Metadata | Retrieve file/folder metadata | | ||
| | Lakehouse Maintenance | Optimize and vacuum Delta tables | | ||
|
|
||
| ### Orchestration Patterns | ||
|
|
||
| ``` | ||
| Pipeline: Daily_ETL_Pipeline | ||
| ├── Get Metadata (check for new files) | ||
| ├── ForEach (process each file) | ||
| │ ├── Copy Data (bronze layer) | ||
| │ └── Notebook (silver transformation) | ||
| ├── Notebook (gold aggregation) | ||
| └── Lakehouse Maintenance (optimize tables) | ||
| ``` | ||
|
|
||
| --- |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,189 @@ | ||
| ### Spark Configuration (Best Practices) | ||
|
|
||
| ```python | ||
| # Enable Fabric optimizations | ||
| spark.conf.set("spark.sql.parquet.vorder.enabled", "true") | ||
| spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") | ||
| ``` | ||
|
|
||
| ### Reading Data | ||
|
|
||
| ```python | ||
| # Read CSV file | ||
| df = spark.read.format("csv") \ | ||
| .option("header", "true") \ | ||
| .option("inferSchema", "true") \ | ||
| .load("Files/bronze/data.csv") | ||
|
|
||
| # Read JSON file | ||
| df = spark.read.format("json").load("Files/bronze/data.json") | ||
|
|
||
| # Read Parquet file | ||
| df = spark.read.format("parquet").load("Files/bronze/data.parquet") | ||
|
|
||
| # Read Delta table | ||
| df = spark.read.table("my_delta_table") | ||
|
|
||
| # Read from SQL endpoint | ||
| df = spark.sql("SELECT * FROM lakehouse.my_table") | ||
| ``` | ||
|
|
||
| ### Writing Delta Tables | ||
|
|
||
| ```python | ||
| # Write DataFrame as managed Delta table | ||
| df.write.format("delta") \ | ||
| .mode("overwrite") \ | ||
| .saveAsTable("silver_customers") | ||
|
|
||
| # Write with partitioning | ||
| df.write.format("delta") \ | ||
| .mode("overwrite") \ | ||
| .partitionBy("year", "month") \ | ||
| .saveAsTable("silver_transactions") | ||
|
|
||
| # Append to existing table | ||
| df.write.format("delta") \ | ||
| .mode("append") \ | ||
| .saveAsTable("silver_events") | ||
| ``` | ||
|
|
||
| ### Delta Table Operations (CRUD) | ||
|
|
||
| ```python | ||
| # UPDATE | ||
| spark.sql(""" | ||
| UPDATE silver_customers | ||
| SET status = 'active' | ||
| WHERE last_login > '2024-01-01' -- Example date, adjust as needed | ||
| """) | ||
|
|
||
| # DELETE | ||
| spark.sql(""" | ||
| DELETE FROM silver_customers | ||
| WHERE is_deleted = true | ||
| """) | ||
|
|
||
| # MERGE (Upsert) | ||
| spark.sql(""" | ||
| MERGE INTO silver_customers AS target | ||
| USING staging_customers AS source | ||
| ON target.customer_id = source.customer_id | ||
| WHEN MATCHED THEN UPDATE SET * | ||
| WHEN NOT MATCHED THEN INSERT * | ||
| """) | ||
| ``` | ||
|
|
||
| ### Schema Definition | ||
|
|
||
| ```python | ||
| from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DecimalType | ||
|
|
||
| schema = StructType([ | ||
| StructField("id", IntegerType(), False), | ||
| StructField("name", StringType(), True), | ||
| StructField("email", StringType(), True), | ||
| StructField("amount", DecimalType(18, 2), True), | ||
| StructField("created_at", TimestampType(), True) | ||
| ]) | ||
|
|
||
| df = spark.read.format("csv") \ | ||
| .schema(schema) \ | ||
| .option("header", "true") \ | ||
| .load("Files/bronze/customers.csv") | ||
| ``` | ||
|
|
||
| ### SQL Magic in Notebooks | ||
|
|
||
| ```sql | ||
| %%sql | ||
| -- Query Delta table directly | ||
| SELECT | ||
| customer_id, | ||
| COUNT(*) as order_count, | ||
| SUM(amount) as total_amount | ||
| FROM gold_orders | ||
| GROUP BY customer_id | ||
| ORDER BY total_amount DESC | ||
| LIMIT 10 | ||
| ``` | ||
|
|
||
| ### V-Order Optimization | ||
|
|
||
| ```python | ||
| # Enable V-Order for read optimization | ||
| spark.conf.set("spark.sql.parquet.vorder.enabled", "true") | ||
| ``` | ||
|
|
||
| ### Table Optimization | ||
|
|
||
| ```sql | ||
| %%sql | ||
| -- Optimize table (compact small files) | ||
| OPTIMIZE silver_transactions | ||
|
|
||
| -- Optimize with Z-ordering on query columns | ||
| OPTIMIZE silver_transactions ZORDER BY (customer_id, transaction_date) | ||
|
|
||
| -- Vacuum old files (default 7 days retention) | ||
| VACUUM silver_transactions | ||
|
|
||
| -- Vacuum with custom retention | ||
| VACUUM silver_transactions RETAIN 168 HOURS | ||
|
|
||
| ``` | ||
|
|
||
| ### Incremental Load Pattern | ||
|
|
||
| ```python | ||
tedvilutis marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| from pyspark.sql.functions import col | ||
|
|
||
| # Get last processed watermark | ||
| last_watermark = spark.sql(""" | ||
| SELECT MAX(processed_timestamp) as watermark | ||
| FROM silver_orders | ||
| """).collect()[0]["watermark"] | ||
|
|
||
| # Load only new records | ||
| new_records = spark.read.format("delta") \ | ||
| .table("bronze_orders") \ | ||
| .filter(col("created_at") > last_watermark) | ||
|
|
||
| # Merge new records | ||
| new_records.createOrReplaceTempView("staging_orders") | ||
| spark.sql(""" | ||
| MERGE INTO silver_orders AS target | ||
| USING staging_orders AS source | ||
| ON target.order_id = source.order_id | ||
| WHEN MATCHED THEN UPDATE SET * | ||
| WHEN NOT MATCHED THEN INSERT * | ||
| """) | ||
| ``` | ||
|
|
||
| ### SCD Type 2 Pattern | ||
|
|
||
| ```python | ||
| from pyspark.sql.functions import current_timestamp, lit | ||
|
|
||
| # Close existing records | ||
| spark.sql(""" | ||
| UPDATE dim_customer | ||
| SET is_current = false, end_date = current_timestamp() | ||
| WHERE customer_id IN (SELECT customer_id FROM staging_customer) | ||
| AND is_current = true | ||
| """) | ||
|
|
||
| # Insert new versions | ||
| spark.sql(""" | ||
| INSERT INTO dim_customer | ||
| SELECT | ||
| customer_id, | ||
| name, | ||
| email, | ||
| address, | ||
| current_timestamp() as start_date, | ||
| null as end_date, | ||
| true as is_current | ||
| FROM staging_customer | ||
| """) | ||
| ``` | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.