> ## Documentation Index > Fetch the complete documentation index at: https://docs.file.ai/llms.txt > Use this file to discover all available pages before exploring further. # Incremental Data Loading > Sync files incrementally to data warehouses using Change Data Capture patterns with the GET /files endpoint. ## Overview The `GET /files` endpoint supports **incremental data loading** through timestamp-based filters. This enables efficient Change Data Capture (CDC) workflows for syncing to Snowflake, BigQuery, or other data warehouses. Instead of fetching all files on every sync, use `updatedAfter` or `createdAfter` to retrieve only what's changed since your last sync. *** ## Timestamp Filters Filters files updated **on or after** the specified timestamp. `GET /files?updatedAfter=2025-01-01T00:00:00.000Z` Filters files created **on or after** the specified timestamp. `GET /files?createdAfter=2025-01-01T00:00:00.000Z` ### Timestamp Contract | Property | Value | | ------------- | -------------------------- | | **Format** | ISO 8601 | | **Timezone** | UTC | | **Precision** | Milliseconds | | **Example** | `2025-01-01T00:00:00.000Z` | All timestamps in the response (`createdAt`, `updatedAt`) are returned in UTC with millisecond precision. *** ## Watermark Semantics ### Inclusive Comparison (`>=`) The `updatedAfter` and `createdAfter` filters use **inclusive** comparison. This means: * A file with `updatedAt = 2025-01-01T12:00:00.000Z` **will be returned** when querying with `updatedAfter=2025-01-01T12:00:00.000Z` ### Deduplication Requirement Because of inclusive comparison, **clients may receive duplicate records** across paginated requests or subsequent syncs. You must deduplicate using `fileId + updatedAt`. **Example deduplication in Snowflake:** ```sql theme={null} MERGE INTO target_table t USING staging_table s ON t.file_id = s.file_id WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ... ``` *** ## Ordering & Pagination ### Default Sort Order | Parameter | Default Value | | ----------- | ------------- | | `sortBy` | `createdAt` | | `sortOrder` | `ASC` | ### Recommended Sort for CDC When using `updatedAfter` for incremental loading, always sort by `updatedAt` in ascending order for reliable watermark tracking. ```bash theme={null} GET /files?updatedAfter=2025-01-01T00:00:00.000Z&sortBy=updatedAt&sortOrder=ASC ``` ### Pagination Parameters The API uses **offset-based pagination** with `page` and `limit` parameters. | Parameter | Default | Maximum | | --------- | ------- | ------- | | `page` | 1 | — | | `limit` | 100 | 100 | **Example request:** ```bash theme={null} GET /files?updatedAfter=2025-01-01T00:00:00.000Z&sortBy=updatedAt&sortOrder=ASC&page=1&limit=100 ``` ### Pagination Stability Warning **Offset-based pagination may produce inconsistent results** if records are updated during pagination. You may miss records or see duplicates. **Mitigation strategies:** 1. Use smaller time windows for `updatedAfter` 2. Always deduplicate by `fileId + updatedAt` 3. Re-sync periodically with a larger time window to catch missed records *** ## What Triggers `updatedAt`? The `updatedAt` timestamp changes when any of these events occur: | Event | Updates `updatedAt` | | ------------------------- | ------------------- | | File metadata changes | ✅ Yes | | Schema/form field updates | ✅ Yes | | Document status changes | ✅ Yes | | Form filling/reprocessing | ✅ Yes | | Tag modifications | ✅ Yes | | Contact/vendor updates | ✅ Yes | | Workflow step changes | ✅ Yes | | Classification changes | ✅ Yes | | OCR reprocessing | ✅ Yes | *** ## Delete Handling **Important**: Deleted files are **excluded** from the `GET /files` response and do **not** appear in `updatedAfter` queries. This endpoint provides **insert/update only** — not full CDC. ### Tracking Deletions If you need to detect deleted files: Do a full sync periodically and compare with your existing data to detect missing records. Contact the fileAI team about a dedicated deletion events endpoint (roadmap item). *** ## Best Practices After each successful sync, store the **maximum `updatedAt`** value from the batch: ```python theme={null} # Pseudocode last_sync_watermark = max(record['updatedAt'] for record in batch) save_watermark(last_sync_watermark) ``` Always include sorting parameters for reliable CDC: ```bash theme={null} GET /files?updatedAfter={watermark}&sortBy=updatedAt&sortOrder=ASC&limit=100 ``` Paginate through **all pages** before updating your watermark: ```python theme={null} # Pseudocode page = 1 all_records = [] while True: response = get_files(updatedAfter=watermark, page=page, limit=100) all_records.extend(response['files']) if len(response['files']) < 100: break page += 1 # Only update watermark after all pages are processed if all_records: new_watermark = max(r['updatedAt'] for r in all_records) save_watermark(new_watermark) ``` {' '} Always deduplicate records by `fileId + updatedAt` before inserting into your data warehouse. Run full syncs (e.g., weekly) to: * Catch any records missed due to pagination issues * Detect deleted records by comparing with existing data *** ## Snowflake Integration Example ```sql theme={null} -- First sync: load all files COPY INTO raw_files FROM ( SELECT $1:fileId, $1:fileName, $1:updatedAt, ... FROM @fileai_stage/files.json ); ``` ```sql theme={null} -- Create staging table for incremental data CREATE TEMPORARY TABLE staging_files AS SELECT * FROM raw_files WHERE 1=0; -- Load incremental data COPY INTO staging_files FROM ( SELECT $1:fileId, $1:fileName, $1:updatedAt, ... FROM @fileai_stage/incremental_files.json ); -- Merge with deduplication MERGE INTO raw_files t USING ( SELECT * FROM staging_files QUALIFY ROW_NUMBER() OVER ( PARTITION BY file_id ORDER BY updated_at DESC ) = 1 ) s ON t.file_id = s.file_id WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET file_name = s.file_name, updated_at = s.updated_at, ... WHEN NOT MATCHED THEN INSERT (file_id, file_name, updated_at, ...) VALUES (s.file_id, s.file_name, s.updated_at, ...); ``` *** ## Response Schema ```json theme={null} { "files": [ { "fileId": "53d6a0b1-2a8d-4ed9-9e6a-ceaef7ca3908", "fileName": "invoice.pdf", "fileType": "application/pdf", "fileSize": 261928, "fileStoragePath": "path/to/file.pdf", "fileHash": "be3ef5fbb21e31c1281300b23b1c918a8ba54427c799aea21865f68d5efd01b7", "uploadId": "f2538513-f0b9-4aa8-9c57-bc0a85c77de6", "status": "processed", "currency": "USD", "summary": "Invoice from Example Corp", "duplicateToFileId": null, "referenceId": "1e70a5e860", "isDuplicate": false, "isEmbedded": false, "schemaId": "6835aca6030a79ffaabca742", "fileClass": "Invoice", "fileContactId": "6835aca2281d9ed1bab90b11", "fileContactName": "Example Company", "createdAt": "2025-05-27T12:14:24.258Z", "updatedAt": "2025-05-27T14:30:00.123Z" } ], "count": 1, "currentPage": 1 } ``` *** ## Quick Reference | Feature | Behavior | | ------------------------ | ---------------------------------- | | Filter semantics | `>=` (inclusive) | | Timestamp format | ISO 8601, UTC, milliseconds | | Pagination | Offset-based (`page`, `limit`) | | Default sort | `createdAt ASC` | | Recommended sort for CDC | `updatedAt ASC` | | Delete visibility | ❌ Not visible (insert/update only) | | Deduplication required | ✅ Yes, by `fileId + updatedAt` | *** ## Need Help? Reach out to the fileAI engineering team for questions about CDC implementation or to request new features.