Incremental Data Loading

Overview

The GET /files endpoint supports incremental data loading through timestamp-based filters. This enables efficient Change Data Capture (CDC) workflows for syncing to Snowflake, BigQuery, or other data warehouses.

Instead of fetching all files on every sync, use updatedAfter or createdAfter to retrieve only what’s changed since your last sync.

Timestamp Filters

updatedAfter

Filters files updated on or after the specified timestamp. GET /files?updatedAfter=2025-01-01T00:00:00.000Z

createdAfter

Filters files created on or after the specified timestamp. GET /files?createdAfter=2025-01-01T00:00:00.000Z

Timestamp Contract

Property	Value
Format	ISO 8601
Timezone	UTC
Precision	Milliseconds
Example	`2025-01-01T00:00:00.000Z`

All timestamps in the response (createdAt, updatedAt) are returned in UTC with millisecond precision.

Watermark Semantics

Inclusive Comparison (`>=`)

The updatedAfter and createdAfter filters use inclusive comparison. This means:

A file with updatedAt = 2025-01-01T12:00:00.000Z will be returned when querying with updatedAfter=2025-01-01T12:00:00.000Z

Deduplication Requirement

Because of inclusive comparison, clients may receive duplicate records across paginated requests or subsequent syncs. You must deduplicate using fileId + updatedAt.

Example deduplication in Snowflake:

MERGE INTO target_table t
USING staging_table s
ON t.file_id = s.file_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN
  UPDATE SET ...
WHEN NOT MATCHED THEN
  INSERT ...

Ordering & Pagination

Default Sort Order

Parameter	Default Value
`sortBy`	`createdAt`
`sortOrder`	`ASC`

Recommended Sort for CDC

When using updatedAfter for incremental loading, always sort by updatedAt in ascending order for reliable watermark tracking.

GET /files?updatedAfter=2025-01-01T00:00:00.000Z&sortBy=updatedAt&sortOrder=ASC

Pagination Parameters

The API uses offset-based pagination with page and limit parameters.

Parameter	Default	Maximum
`page`	1	—
`limit`	100	100

Example request:

GET /files?updatedAfter=2025-01-01T00:00:00.000Z&sortBy=updatedAt&sortOrder=ASC&page=1&limit=100

Pagination Stability Warning

Offset-based pagination may produce inconsistent results if records are updated during pagination. You may miss records or see duplicates.

Mitigation strategies:

Use smaller time windows for updatedAfter
Always deduplicate by fileId + updatedAt
Re-sync periodically with a larger time window to catch missed records

What Triggers `updatedAt`?

The updatedAt timestamp changes when any of these events occur:

Event	Updates `updatedAt`
File metadata changes	✅ Yes
Schema/form field updates	✅ Yes
Document status changes	✅ Yes
Form filling/reprocessing	✅ Yes
Tag modifications	✅ Yes
Contact/vendor updates	✅ Yes
Workflow step changes	✅ Yes
Classification changes	✅ Yes
OCR reprocessing	✅ Yes

Delete Handling

Important: Deleted files are excluded from the GET /files response and do not appear in updatedAfter queries. This endpoint provides insert/update only — not full CDC.

Tracking Deletions

If you need to detect deleted files:

Option A: Periodic Full Sync

Do a full sync periodically and compare with your existing data to detect missing records.

Option B: Deletion Events Endpoint

Contact the fileAI team about a dedicated deletion events endpoint (roadmap item).

Best Practices

1. Store Your Watermark

After each successful sync, store the maximum updatedAt value from the batch:

# Pseudocode
last_sync_watermark = max(record['updatedAt'] for record in batch)
save_watermark(last_sync_watermark)

2. Use Recommended Query Parameters

Always include sorting parameters for reliable CDC:

GET /files?updatedAfter={watermark}&sortBy=updatedAt&sortOrder=ASC&limit=100

3. Handle Pagination Completely

Paginate through all pages before updating your watermark:

# Pseudocode
page = 1
all_records = []

while True:
    response = get_files(updatedAfter=watermark, page=page, limit=100)
    all_records.extend(response['files'])

    if len(response['files']) < 100:
        break
    page += 1

# Only update watermark after all pages are processed
if all_records:
    new_watermark = max(r['updatedAt'] for r in all_records)
    save_watermark(new_watermark)

4. Deduplicate Before Loading

Always deduplicate records by fileId + updatedAt before inserting into your data warehouse.

5. Schedule Periodic Full Syncs

Run full syncs (e.g., weekly) to:

Catch any records missed due to pagination issues
Detect deleted records by comparing with existing data

Snowflake Integration Example

Initial Load
Incremental Load

-- First sync: load all files
COPY INTO raw_files
FROM (
  SELECT $1:fileId, $1:fileName, $1:updatedAt, ...
  FROM @fileai_stage/files.json
);

-- Create staging table for incremental data
CREATE TEMPORARY TABLE staging_files AS
SELECT * FROM raw_files WHERE 1=0;

-- Load incremental data
COPY INTO staging_files
FROM (
  SELECT $1:fileId, $1:fileName, $1:updatedAt, ...
  FROM @fileai_stage/incremental_files.json
);

-- Merge with deduplication
MERGE INTO raw_files t
USING (
  SELECT * FROM staging_files
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY file_id
    ORDER BY updated_at DESC
  ) = 1
) s
ON t.file_id = s.file_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN
  UPDATE SET
    file_name = s.file_name,
    updated_at = s.updated_at,
    ...
WHEN NOT MATCHED THEN
  INSERT (file_id, file_name, updated_at, ...)
  VALUES (s.file_id, s.file_name, s.updated_at, ...);

Response Schema

{
  "files": [
    {
      "fileId": "53d6a0b1-2a8d-4ed9-9e6a-ceaef7ca3908",
      "fileName": "invoice.pdf",
      "fileType": "application/pdf",
      "fileSize": 261928,
      "fileStoragePath": "path/to/file.pdf",
      "fileHash": "be3ef5fbb21e31c1281300b23b1c918a8ba54427c799aea21865f68d5efd01b7",
      "uploadId": "f2538513-f0b9-4aa8-9c57-bc0a85c77de6",
      "status": "processed",
      "currency": "USD",
      "summary": "Invoice from Example Corp",
      "duplicateToFileId": null,
      "referenceId": "1e70a5e860",
      "isDuplicate": false,
      "isEmbedded": false,
      "schemaId": "6835aca6030a79ffaabca742",
      "fileClass": "Invoice",
      "fileContactId": "6835aca2281d9ed1bab90b11",
      "fileContactName": "Example Company",
      "createdAt": "2025-05-27T12:14:24.258Z",
      "updatedAt": "2025-05-27T14:30:00.123Z"
    }
  ],
  "count": 1,
  "currentPage": 1
}

Quick Reference

Feature	Behavior
Filter semantics	`>=` (inclusive)
Timestamp format	ISO 8601, UTC, milliseconds
Pagination	Offset-based (`page`, `limit`)
Default sort	`createdAt ASC`
Recommended sort for CDC	`updatedAt ASC`
Delete visibility	❌ Not visible (insert/update only)
Deduplication required	✅ Yes, by `fileId + updatedAt`

Need Help?

Contact Support

Reach out to the fileAI engineering team for questions about CDC implementation or to request new features.

fileAI API

Key Functions

Endpoints

Incremental Data Loading

Overview

Timestamp Filters

updatedAfter

createdAfter

Timestamp Contract

Watermark Semantics

Inclusive Comparison (`>=`)

Deduplication Requirement

Default Sort Order

Recommended Sort for CDC

What Triggers `updatedAt`?

Delete Handling

Tracking Deletions

Best Practices

Snowflake Integration Example

Response Schema

Quick Reference

Need Help?

Contact Support

fileAI API

Key Functions

Endpoints

​Overview

​Timestamp Filters

updatedAfter

createdAfter

​Timestamp Contract

​Watermark Semantics

​Inclusive Comparison (>=)

​Deduplication Requirement

​Ordering & Pagination

​Default Sort Order

​Recommended Sort for CDC

​Pagination Parameters

​Pagination Stability Warning

​What Triggers updatedAt?

​Delete Handling

​Tracking Deletions

​Best Practices

​Snowflake Integration Example

​Response Schema

​Quick Reference

​Need Help?

Contact Support

Overview

Timestamp Filters

Timestamp Contract

Watermark Semantics

Inclusive Comparison (`>=`)

Deduplication Requirement

Ordering & Pagination

Default Sort Order

Recommended Sort for CDC

Pagination Parameters

Pagination Stability Warning

What Triggers `updatedAt`?

Delete Handling

Tracking Deletions

Best Practices

Snowflake Integration Example

Response Schema

Quick Reference

Need Help?