Data Loading Guide¶

Complete guide to understanding data loading in grai.build.

Overview¶

Important Philosophy: grai.build is a schema management tool, not a data loading tool.

Think of it like database migrations:

Alembic/Flyway manage your schema (tables, columns, constraints)
Your application/ETL manages your data
grai.build manages your graph schema (entities, relations, constraints)
Your data pipeline manages your graph data

What grai.build Does¶

Schema: Define entities, relations, and properties in YAML
Validation: Ensure schema consistency before deployment
Generation: Create Cypher constraints and indexes
Documentation: Auto-generate visualizations and lineage

What grai.build Does NOT Do (in Production)¶

ETL: Extract data from source systems (use Airbyte, Fivetran, dbt)
Data Pipelines: Schedule and orchestrate data loading (use Airflow, Prefect)
Real-time Sync: Stream changes to your graph (use Kafka, CDC, application code)

CSV Loading is for Development Only¶

The --load-csv feature exists only for:

Quick local testing
Demos and tutorials
Validating schema with sample data

In production, you need proper ETL pipelines. See strategies below.

Schema-Only Mode (Default)¶

What it does¶

Creates only the database schema:

Unique constraints on entity keys
Indexes on entity properties
No actual data nodes or relationships

When to use¶

Getting started with a new project
Setting up a new database
Testing schema definitions
CI/CD pipelines (schema validation)

How to use¶

# Schema only (default)
grai run --uri bolt://localhost:7687 --user neo4j --password secret

# Explicit flag
grai run --schema-only --uri bolt://localhost:7687 --user neo4j --password secret

What gets created¶

// Constraints for unique keys
CREATE CONSTRAINT constraint_customer_customer_id IF NOT EXISTS
FOR (n:customer) REQUIRE n.customer_id IS UNIQUE;

// Indexes for properties
CREATE INDEX index_customer_name IF NOT EXISTS
FOR (n:customer) ON (n.name);

CREATE INDEX index_customer_email IF NOT EXISTS
FOR (n:customer) ON (n.email);

With Data Mode¶

What it does¶

Generates MERGE statements with row.property placeholders designed for use with LOAD CSV or parameterized queries.

When to use¶

You have CSV files prepared
You're using custom data loading scripts
You need to generate templates for ETL pipelines

How to use¶

# Generate data loading statements
grai run --with-data --uri bolt://localhost:7687 --user neo4j --password secret

Important Note¶

The --with-data flag generates Cypher like this:

MERGE (n:customer {customer_id: row.customer_id})
SET n.name = row.name,
    n.email = row.email;

This will fail if executed directly because row is undefined. You need to either:

Wrap it in a LOAD CSV statement
Use it as a template for parameterized queries
Use Python/application code to supply parameters

Quick Start with Sample Data¶

When you run grai init, sample CSV files and a loading script are automatically created:

your-project/
├── data/
│   ├── customers.csv      # 5 sample customers
│   ├── products.csv       # 6 sample products
│   └── purchased.csv      # 10 sample orders
└── load_data.cypher       # Ready-to-use Cypher script

To load the sample data immediately:

Option 1: One Command (Recommended)¶

# Create schema AND load CSV data in one command
grai run --load-csv --password yourpassword

This will:

Build and validate your project
Create the schema (constraints & indexes)
Automatically load CSV data from load_data.cypher

Option 2: Manual (Neo4j Browser)¶

Create the schema:

grai run --password yourpassword

Load the CSV data:
Open Neo4j Browser (http://localhost:7474)
Copy and paste the contents of load_data.cypher
Run the script

Option 3: Manual (cypher-shell)¶

# Create schema
grai run --password yourpassword

# Load data
cat load_data.cypher | cypher-shell -u neo4j -p yourpassword

That's it! Your graph is now populated with sample data.

Data Loading Strategies¶

Strategy 1: Python Scripts (Recommended)¶

Use Python to load data directly:

from grai.core.loader.neo4j_loader import connect_neo4j, execute_cypher, close_connection

URI = "bolt://localhost:7687"
USER = "neo4j"
PASSWORD = "graipassword"

DATA = """
CREATE (c1:customer {
    customer_id: 'C001',
    name: 'Alice Johnson',
    email: 'alice@example.com',
    created_at: datetime('2024-01-15')
});

CREATE (c2:customer {
    customer_id: 'C002',
    name: 'Bob Smith',
    email: 'bob@example.com',
    created_at: datetime('2024-02-01')
});
"""

driver = connect_neo4j(uri=URI, user=USER, password=PASSWORD)
result = execute_cypher(driver, DATA)

if result.success:
    print(f"✅ Loaded {result.records_affected} records")

close_connection(driver)

Strategy 2: LOAD CSV¶

Prepare CSV files and use Neo4j's LOAD CSV:

customers.csv:

customer_id,name,email,created_at
C001,Alice Johnson,alice@example.com,2024-01-15T00:00:00Z
C002,Bob Smith,bob@example.com,2024-02-01T00:00:00Z

Load script:

LOAD CSV WITH HEADERS FROM 'file:///data/customers.csv' AS row
MERGE (n:customer {customer_id: row.customer_id})
SET n.name = row.name,
    n.email = row.email,
    n.created_at = datetime(row.created_at);

Strategy 3: Application Integration¶

Use the neo4j driver in your application:

from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    "bolt://localhost:7687",
    auth=("neo4j", "password")
)

def create_customer(tx, customer_id, name, email):
    query = """
    MERGE (c:customer {customer_id: $customer_id})
    SET c.name = $name,
        c.email = $email
    """
    tx.run(query, customer_id=customer_id, name=name, email=email)

with driver.session() as session:
    session.write_transaction(
        create_customer,
        "C001",
        "Alice Johnson",
        "alice@example.com"
    )

driver.close()

Recommended Workflow¶

1. Create Schema¶

# First, create the schema
grai run --uri bolt://localhost:7687 --user neo4j --password secret

2. Verify Schema¶

-- In Neo4j Browser
SHOW CONSTRAINTS;
SHOW INDEXES;

3. Load Data¶

Choose your strategy:

Small datasets: Python scripts (Strategy 1)
Large datasets: LOAD CSV (Strategy 2)
Production apps: Application integration (Strategy 3)

4. Verify Data¶

-- Check what was loaded
MATCH (n)
RETURN labels(n) AS type, count(n) AS count;

-- View sample data
MATCH (n:customer)
RETURN n
LIMIT 5;

Troubleshooting¶

Error: "Variable `row` not defined"¶

Cause: Trying to execute data loading Cypher without LOAD CSV context.

Solution: Use --schema-only (default) instead of --with-data:

# This works (default)
grai run

# This will fail without CSV files
grai run --with-data

No data appears after `grai run`¶

Expected behavior: By default, grai run only creates the schema, not data.

Solution: Load data using Python scripts (see Strategy 1 above).

CSV loading fails¶

Common issues:

File path: Make sure CSV is in Neo4j import directory
Headers: CSV must have headers matching property names
Encoding: Use UTF-8 encoding
Line endings: Unix (LF) line endings preferred

Examples¶

Complete Example: Schema + Data¶

# Step 1: Create schema
grai run --uri bolt://localhost:7687 --user neo4j --password secret

# Step 2: Load data
cat > load_data.py << 'EOF'
from grai.core.loader.neo4j_loader import connect_neo4j, execute_cypher, close_connection

driver = connect_neo4j(
    uri="bolt://localhost:7687",
    user="neo4j",
    password="secret"
)

data = """
CREATE (c:customer {customer_id: 'C001', name: 'Alice', email: 'alice@example.com'});
CREATE (p:product {product_id: 'P001', name: 'Laptop', price: 999.99});
MATCH (c:customer {customer_id: 'C001'})
MATCH (p:product {product_id: 'P001'})
CREATE (c)-[:PURCHASED {order_id: 'O001', order_date: date()}]->(p);
"""

result = execute_cypher(driver, data)
print(f"✅ Created {result.records_affected} records")
close_connection(driver)
EOF

python load_data.py

# Step 3: Verify
# Open Neo4j Browser and run:
# MATCH (n)-[r]->(m) RETURN n, r, m LIMIT 25;

Best Practices¶

Always create schema first before loading data
Use schema-only mode by default (it's the default for a reason)
Load data separately using Python scripts or LOAD CSV
Test with small datasets before loading production data
Use transactions for bulk loading
Add error handling in data loading scripts
Validate data before loading (check for nulls, duplicates, etc.)

Getting Started Guide - Complete beginner tutorial
Neo4j Setup Guide - Local Neo4j installation
CLI Usage - Complete CLI reference
Compiler Documentation - Cypher generation details

Questions? Issues?

File an issue on GitHub: https://github.com/grai-build/grai.build/issues

Data Loading Guide¶

Overview¶

What grai.build Does¶

What grai.build Does NOT Do (in Production)¶

CSV Loading is for Development Only¶

Schema-Only Mode (Default)¶

What it does¶

When to use¶

How to use¶

What gets created¶

With Data Mode¶

What it does¶

When to use¶

How to use¶

Important Note¶

Quick Start with Sample Data¶

Option 1: One Command (Recommended)¶

Option 2: Manual (Neo4j Browser)¶

Option 3: Manual (cypher-shell)¶

Data Loading Strategies¶

Strategy 1: Python Scripts (Recommended)¶

Strategy 2: LOAD CSV¶

Strategy 3: Application Integration¶

Recommended Workflow¶

1. Create Schema¶

2. Verify Schema¶

3. Load Data¶

4. Verify Data¶

Troubleshooting¶

Error: "Variable row not defined"¶

No data appears after grai run¶

CSV loading fails¶

Examples¶

Complete Example: Schema + Data¶

Best Practices¶

Related Documentation¶

Error: "Variable `row` not defined"¶

No data appears after `grai run`¶