Enhanced Source Configuration¶
Overview¶
As of v0.3.0, grai.build supports both simple string sources (backward compatible) and detailed source configuration for entities and relations.
Simple Source (Backward Compatible)¶
The traditional string format still works:
entity: customer
source: analytics.customers
keys: [customer_id]
properties:
- name: customer_id
type: string
grai.build automatically infers the source type:
schema.table→type: table*.csv→type: csv*.json→type: json*.parquet→type: parquethttp://...→type: api
Detailed Source Configuration¶
For more complex scenarios, use the expanded format:
entity: customer
source:
name: customers
type: table
connection: prod_analytics
db_schema: public
database: analytics_db
format: delta
metadata:
owner: data-team
refresh_schedule: "0 0 * * *"
keys: [customer_id]
properties:
- name: customer_id
type: string
- name: name
type: string
Source Types¶
Supported type values:
database- Database connectiontable- Database table or viewcsv- CSV filejson- JSON fileparquet- Parquet fileapi- REST API endpointstream- Data stream (Kafka, Kinesis, etc.)other- Custom source type
Field Descriptions¶
| Field | Type | Description | Example |
|---|---|---|---|
name |
string | Source identifier (required) | customers, analytics.customers |
type |
enum | Source type | table, csv, api |
connection |
string | Connection or data source name | prod_db, s3_bucket |
db_schema |
string | Database schema name | public, analytics |
database |
string | Database name | analytics_db, warehouse |
format |
string | Data format details | delta, iceberg, parquet |
metadata |
dict | Additional custom metadata | {owner: "team", version: "v2"} |
Examples¶
CSV File¶
entity: product
source:
name: products.csv
type: csv
format: utf-8
metadata:
location: s3://my-bucket/data/
keys: [product_id]
API Endpoint¶
entity: order
source:
name: /api/v1/orders
type: api
connection: rest_api
metadata:
base_url: https://api.example.com
auth: bearer_token
keys: [order_id]
Database Table with Schema¶
entity: transaction
source:
name: transactions
type: table
database: financial_db
db_schema: public
connection: prod_postgres
metadata:
partitioned_by: date
keys: [transaction_id]
Kafka Stream¶
entity: event
source:
name: user-events
type: stream
connection: kafka_prod
metadata:
topic: user.events.v1
consumer_group: grai-consumer
keys: [event_id]
Benefits¶
- Better Documentation - Clear indication of source type and location
- Multiple Connections - Support for different environments (dev, staging, prod)
- Metadata Tracking - Store ownership, refresh schedules, and custom info
- Type Safety - Explicit source types prevent confusion
- Tooling Integration - Easier integration with data catalogs and lineage tools
Migration Guide¶
Existing YAML files with simple string sources continue to work without changes. To migrate to the enhanced format:
Before:
After:
Usage in Code¶
The Python API automatically handles both formats:
from grai.core.models import Entity
# Simple string (auto-converted)
entity1 = Entity(
entity="customer",
source="analytics.customers",
keys=["customer_id"],
properties=[]
)
# Get source name
print(entity1.get_source_name()) # "analytics.customers"
# Get full config
config = entity1.get_source_config()
print(config.type) # SourceType.TABLE (inferred)
# Detailed config
entity2 = Entity(
entity="customer",
source={
"name": "customers",
"type": "table",
"db_schema": "analytics"
},
keys=["customer_id"],
properties=[]
)
Graph IR Export¶
The enhanced source configuration is fully exported in the Graph IR (JSON):
{
"entities": [
{
"name": "customer",
"source": {
"name": "customers",
"type": "table",
"connection": "prod_db",
"schema": "analytics",
"database": "warehouse",
"format": null,
"metadata": {}
},
"keys": ["customer_id"]
}
]
}
This makes it easy to integrate with data catalogs, lineage tools, and other metadata systems.