Skip to main content

Overview

The GET /files endpoint supports incremental data loading through timestamp-based filters. This enables efficient Change Data Capture (CDC) workflows for syncing to Snowflake, BigQuery, or other data warehouses.
Instead of fetching all files on every sync, use updatedAfter or createdAfter to retrieve only what’s changed since your last sync.

Timestamp Filters

updatedAfter

Filters files updated on or after the specified timestamp. GET /files?updatedAfter=2025-01-01T00:00:00.000Z

createdAfter

Filters files created on or after the specified timestamp. GET /files?createdAfter=2025-01-01T00:00:00.000Z

Timestamp Contract

PropertyValue
FormatISO 8601
TimezoneUTC
PrecisionMilliseconds
Example2025-01-01T00:00:00.000Z
All timestamps in the response (createdAt, updatedAt) are returned in UTC with millisecond precision.

Watermark Semantics

Inclusive Comparison (>=)

The updatedAfter and createdAfter filters use inclusive comparison. This means:
  • A file with updatedAt = 2025-01-01T12:00:00.000Z will be returned when querying with updatedAfter=2025-01-01T12:00:00.000Z

Deduplication Requirement

Because of inclusive comparison, clients may receive duplicate records across paginated requests or subsequent syncs. You must deduplicate using fileId + updatedAt.
Example deduplication in Snowflake:
MERGE INTO target_table t
USING staging_table s
ON t.file_id = s.file_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN
  UPDATE SET ...
WHEN NOT MATCHED THEN
  INSERT ...

Ordering & Pagination

Default Sort Order

ParameterDefault Value
sortBycreatedAt
sortOrderASC
When using updatedAfter for incremental loading, always sort by updatedAt in ascending order for reliable watermark tracking.
GET /files?updatedAfter=2025-01-01T00:00:00.000Z&sortBy=updatedAt&sortOrder=ASC

Pagination Parameters

The API uses offset-based pagination with page and limit parameters.
ParameterDefaultMaximum
page1
limit100100
Example request:
GET /files?updatedAfter=2025-01-01T00:00:00.000Z&sortBy=updatedAt&sortOrder=ASC&page=1&limit=100

Pagination Stability Warning

Offset-based pagination may produce inconsistent results if records are updated during pagination. You may miss records or see duplicates.
Mitigation strategies:
  1. Use smaller time windows for updatedAfter
  2. Always deduplicate by fileId + updatedAt
  3. Re-sync periodically with a larger time window to catch missed records

What Triggers updatedAt?

The updatedAt timestamp changes when any of these events occur:
EventUpdates updatedAt
File metadata changes✅ Yes
Schema/form field updates✅ Yes
Document status changes✅ Yes
Form filling/reprocessing✅ Yes
Tag modifications✅ Yes
Contact/vendor updates✅ Yes
Workflow step changes✅ Yes
Classification changes✅ Yes
OCR reprocessing✅ Yes

Delete Handling

Important: Deleted files are excluded from the GET /files response and do not appear in updatedAfter queries. This endpoint provides insert/update only — not full CDC.

Tracking Deletions

If you need to detect deleted files:
1

Option A: Periodic Full Sync

Do a full sync periodically and compare with your existing data to detect missing records.
2

Option B: Deletion Events Endpoint

Contact the fileAI team about a dedicated deletion events endpoint (roadmap item).

Best Practices

After each successful sync, store the maximum updatedAt value from the batch:
# Pseudocode
last_sync_watermark = max(record['updatedAt'] for record in batch)
save_watermark(last_sync_watermark)
Paginate through all pages before updating your watermark:
# Pseudocode
page = 1
all_records = []

while True:
    response = get_files(updatedAfter=watermark, page=page, limit=100)
    all_records.extend(response['files'])

    if len(response['files']) < 100:
        break
    page += 1

# Only update watermark after all pages are processed
if all_records:
    new_watermark = max(r['updatedAt'] for r in all_records)
    save_watermark(new_watermark)
Always deduplicate records by fileId + updatedAt before inserting into your data warehouse.
Run full syncs (e.g., weekly) to:
  • Catch any records missed due to pagination issues
  • Detect deleted records by comparing with existing data

Snowflake Integration Example

-- First sync: load all files
COPY INTO raw_files
FROM (
  SELECT $1:fileId, $1:fileName, $1:updatedAt, ...
  FROM @fileai_stage/files.json
);

Response Schema

{
  "files": [
    {
      "fileId": "53d6a0b1-2a8d-4ed9-9e6a-ceaef7ca3908",
      "fileName": "invoice.pdf",
      "fileType": "application/pdf",
      "fileSize": 261928,
      "fileStoragePath": "path/to/file.pdf",
      "fileHash": "be3ef5fbb21e31c1281300b23b1c918a8ba54427c799aea21865f68d5efd01b7",
      "uploadId": "f2538513-f0b9-4aa8-9c57-bc0a85c77de6",
      "status": "processed",
      "currency": "USD",
      "summary": "Invoice from Example Corp",
      "duplicateToFileId": null,
      "referenceId": "1e70a5e860",
      "isDuplicate": false,
      "isEmbedded": false,
      "schemaId": "6835aca6030a79ffaabca742",
      "fileClass": "Invoice",
      "fileContactId": "6835aca2281d9ed1bab90b11",
      "fileContactName": "Example Company",
      "createdAt": "2025-05-27T12:14:24.258Z",
      "updatedAt": "2025-05-27T14:30:00.123Z"
    }
  ],
  "count": 1,
  "currentPage": 1
}

Quick Reference

FeatureBehavior
Filter semantics>= (inclusive)
Timestamp formatISO 8601, UTC, milliseconds
PaginationOffset-based (page, limit)
Default sortcreatedAt ASC
Recommended sort for CDCupdatedAt ASC
Delete visibility❌ Not visible (insert/update only)
Deduplication required✅ Yes, by fileId + updatedAt

Need Help?

Contact Support

Reach out to the fileAI engineering team for questions about CDC implementation or to request new features.