📊 Data Loading Guide¶
Complete guide to understanding data loading in grai.build.
🎯 Overview¶
Important Philosophy: grai.build is a schema management tool, not a data loading tool.
Think of it like database migrations:
- Alembic/Flyway manage your schema (tables, columns, constraints)
- Your application/ETL manages your data
- grai.build manages your graph schema (entities, relations, constraints)
- Your data pipeline manages your graph data
What grai.build Does¶
- Schema: Define entities, relations, and properties in YAML
- Validation: Ensure schema consistency before deployment
- Generation: Create Cypher constraints and indexes
- Documentation: Auto-generate visualizations and lineage
What grai.build Does NOT Do (in Production)¶
- ETL: Extract data from source systems (use Airbyte, Fivetran, dbt)
- Data Pipelines: Schedule and orchestrate data loading (use Airflow, Prefect)
- Real-time Sync: Stream changes to your graph (use Kafka, CDC, application code)
CSV Loading is for Development Only¶
The --load-csv feature exists only for:
- ✅ Quick local testing
- ✅ Demos and tutorials
- ✅ Validating schema with sample data
In production, you need proper ETL pipelines. See strategies below.
🏗️ Schema-Only Mode (Default)¶
What it does¶
Creates only the database schema:
- Unique constraints on entity keys
- Indexes on entity properties
- No actual data nodes or relationships
When to use¶
- Getting started with a new project
- Setting up a new database
- Testing schema definitions
- CI/CD pipelines (schema validation)
How to use¶
# Schema only (default)
grai run --uri bolt://localhost:7687 --user neo4j --password secret
# Explicit flag
grai run --schema-only --uri bolt://localhost:7687 --user neo4j --password secret
What gets created¶
// Constraints for unique keys
CREATE CONSTRAINT constraint_customer_customer_id IF NOT EXISTS
FOR (n:customer) REQUIRE n.customer_id IS UNIQUE;
// Indexes for properties
CREATE INDEX index_customer_name IF NOT EXISTS
FOR (n:customer) ON (n.name);
CREATE INDEX index_customer_email IF NOT EXISTS
FOR (n:customer) ON (n.email);
📦 With Data Mode¶
What it does¶
Generates MERGE statements with row.property placeholders designed for use with LOAD CSV or parameterized queries.
When to use¶
- You have CSV files prepared
- You're using custom data loading scripts
- You need to generate templates for ETL pipelines
How to use¶
# Generate data loading statements
grai run --with-data --uri bolt://localhost:7687 --user neo4j --password secret
⚠️ Important Note¶
The --with-data flag generates Cypher like this:
This will fail if executed directly because row is undefined. You need to either:
- Wrap it in a
LOAD CSVstatement - Use it as a template for parameterized queries
- Use Python/application code to supply parameters
🎁 Quick Start with Sample Data¶
When you run grai init, sample CSV files and a loading script are automatically created:
your-project/
├── data/
│ ├── customers.csv # 5 sample customers
│ ├── products.csv # 6 sample products
│ └── purchased.csv # 10 sample orders
└── load_data.cypher # Ready-to-use Cypher script
To load the sample data immediately:
Option 1: One Command (Recommended)¶
This will:
- Build and validate your project
- Create the schema (constraints & indexes)
- Automatically load CSV data from
load_data.cypher
Option 2: Manual (Neo4j Browser)¶
- Create the schema:
- Load the CSV data:
- Open Neo4j Browser (http://localhost:7474)
- Copy and paste the contents of
load_data.cypher - Run the script
Option 3: Manual (cypher-shell)¶
# Create schema
grai run --password yourpassword
# Load data
cat load_data.cypher | cypher-shell -u neo4j -p yourpassword
That's it! Your graph is now populated with sample data.
🔄 Data Loading Strategies¶
Strategy 1: Python Scripts (Recommended)¶
Use Python to load data directly:
from grai.core.loader.neo4j_loader import connect_neo4j, execute_cypher, close_connection
URI = "bolt://localhost:7687"
USER = "neo4j"
PASSWORD = "graipassword"
DATA = """
CREATE (c1:customer {
customer_id: 'C001',
name: 'Alice Johnson',
email: 'alice@example.com',
created_at: datetime('2024-01-15')
});
CREATE (c2:customer {
customer_id: 'C002',
name: 'Bob Smith',
email: 'bob@example.com',
created_at: datetime('2024-02-01')
});
"""
driver = connect_neo4j(uri=URI, user=USER, password=PASSWORD)
result = execute_cypher(driver, DATA)
if result.success:
print(f"✅ Loaded {result.records_affected} records")
close_connection(driver)
Strategy 2: LOAD CSV¶
Prepare CSV files and use Neo4j's LOAD CSV:
customers.csv:
customer_id,name,email,created_at
C001,Alice Johnson,alice@example.com,2024-01-15T00:00:00Z
C002,Bob Smith,bob@example.com,2024-02-01T00:00:00Z
Load script:
LOAD CSV WITH HEADERS FROM 'file:///data/customers.csv' AS row
MERGE (n:customer {customer_id: row.customer_id})
SET n.name = row.name,
n.email = row.email,
n.created_at = datetime(row.created_at);
Strategy 3: Application Integration¶
Use the neo4j driver in your application:
from neo4j import GraphDatabase
driver = GraphDatabase.driver(
"bolt://localhost:7687",
auth=("neo4j", "password")
)
def create_customer(tx, customer_id, name, email):
query = """
MERGE (c:customer {customer_id: $customer_id})
SET c.name = $name,
c.email = $email
"""
tx.run(query, customer_id=customer_id, name=name, email=email)
with driver.session() as session:
session.write_transaction(
create_customer,
"C001",
"Alice Johnson",
"alice@example.com"
)
driver.close()
🚀 Recommended Workflow¶
1. Create Schema¶
2. Verify Schema¶
3. Load Data¶
Choose your strategy:
- Small datasets: Python scripts (Strategy 1)
- Large datasets: LOAD CSV (Strategy 2)
- Production apps: Application integration (Strategy 3)
4. Verify Data¶
-- Check what was loaded
MATCH (n)
RETURN labels(n) AS type, count(n) AS count;
-- View sample data
MATCH (n:customer)
RETURN n
LIMIT 5;
🔧 Troubleshooting¶
Error: "Variable row not defined"¶
Cause: Trying to execute data loading Cypher without LOAD CSV context.
Solution: Use --schema-only (default) instead of --with-data:
No data appears after grai run¶
Expected behavior: By default, grai run only creates the schema, not data.
Solution: Load data using Python scripts (see Strategy 1 above).
CSV loading fails¶
Common issues:
- File path: Make sure CSV is in Neo4j import directory
- Headers: CSV must have headers matching property names
- Encoding: Use UTF-8 encoding
- Line endings: Unix (LF) line endings preferred
📚 Examples¶
Complete Example: Schema + Data¶
# Step 1: Create schema
grai run --uri bolt://localhost:7687 --user neo4j --password secret
# Step 2: Load data
cat > load_data.py << 'EOF'
from grai.core.loader.neo4j_loader import connect_neo4j, execute_cypher, close_connection
driver = connect_neo4j(
uri="bolt://localhost:7687",
user="neo4j",
password="secret"
)
data = """
CREATE (c:customer {customer_id: 'C001', name: 'Alice', email: 'alice@example.com'});
CREATE (p:product {product_id: 'P001', name: 'Laptop', price: 999.99});
MATCH (c:customer {customer_id: 'C001'})
MATCH (p:product {product_id: 'P001'})
CREATE (c)-[:PURCHASED {order_id: 'O001', order_date: date()}]->(p);
"""
result = execute_cypher(driver, data)
print(f"✅ Created {result.records_affected} records")
close_connection(driver)
EOF
python load_data.py
# Step 3: Verify
# Open Neo4j Browser and run:
# MATCH (n)-[r]->(m) RETURN n, r, m LIMIT 25;
🎯 Best Practices¶
- Always create schema first before loading data
- Use schema-only mode by default (it's the default for a reason)
- Load data separately using Python scripts or LOAD CSV
- Test with small datasets before loading production data
- Use transactions for bulk loading
- Add error handling in data loading scripts
- Validate data before loading (check for nulls, duplicates, etc.)
📖 Related Documentation¶
- Getting Started Guide - Complete beginner tutorial
- Neo4j Setup Guide - Local Neo4j installation
- CLI Usage - Complete CLI reference
- Compiler Documentation - Cypher generation details
Questions? Issues?
File an issue on GitHub: https://github.com/grai-build/grai.build/issues