> ## Documentation Index
> Fetch the complete documentation index at: https://docs.file.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Incremental Data Loading

> Sync files incrementally to data warehouses using Change Data Capture patterns with the GET /files endpoint.

## Overview

The `GET /files` endpoint supports **incremental data loading** through timestamp-based filters. This enables efficient Change Data Capture (CDC) workflows for syncing to Snowflake, BigQuery, or other data warehouses.

<Info>
  Instead of fetching all files on every sync, use `updatedAfter` or
  `createdAfter` to retrieve only what's changed since your last sync.
</Info>

***

## Timestamp Filters

<CardGroup cols={2}>
  <Card title="updatedAfter" icon="clock-rotate-left">
    Filters files updated **on or after** the specified timestamp. `GET
            /files?updatedAfter=2025-01-01T00:00:00.000Z`
  </Card>

  <Card title="createdAfter" icon="calendar-plus">
    Filters files created **on or after** the specified timestamp. `GET
            /files?createdAfter=2025-01-01T00:00:00.000Z`
  </Card>
</CardGroup>

### Timestamp Contract

| Property      | Value                      |
| ------------- | -------------------------- |
| **Format**    | ISO 8601                   |
| **Timezone**  | UTC                        |
| **Precision** | Milliseconds               |
| **Example**   | `2025-01-01T00:00:00.000Z` |

<Note>
  All timestamps in the response (`createdAt`, `updatedAt`) are returned in UTC
  with millisecond precision.
</Note>

***

## Watermark Semantics

### Inclusive Comparison (`>=`)

The `updatedAfter` and `createdAfter` filters use **inclusive** comparison. This means:

* A file with `updatedAt = 2025-01-01T12:00:00.000Z` **will be returned** when querying with `updatedAfter=2025-01-01T12:00:00.000Z`

### Deduplication Requirement

<Warning>
  Because of inclusive comparison, **clients may receive duplicate records**
  across paginated requests or subsequent syncs. You must deduplicate using
  `fileId + updatedAt`.
</Warning>

**Example deduplication in Snowflake:**

```sql theme={null}
MERGE INTO target_table t
USING staging_table s
ON t.file_id = s.file_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN
  UPDATE SET ...
WHEN NOT MATCHED THEN
  INSERT ...
```

***

## Ordering & Pagination

### Default Sort Order

| Parameter   | Default Value |
| ----------- | ------------- |
| `sortBy`    | `createdAt`   |
| `sortOrder` | `ASC`         |

### Recommended Sort for CDC

<Tip>
  When using `updatedAfter` for incremental loading, always sort by `updatedAt`
  in ascending order for reliable watermark tracking.
</Tip>

```bash theme={null}
GET /files?updatedAfter=2025-01-01T00:00:00.000Z&sortBy=updatedAt&sortOrder=ASC
```

### Pagination Parameters

The API uses **offset-based pagination** with `page` and `limit` parameters.

| Parameter | Default | Maximum |
| --------- | ------- | ------- |
| `page`    | 1       | —       |
| `limit`   | 100     | 100     |

**Example request:**

```bash theme={null}
GET /files?updatedAfter=2025-01-01T00:00:00.000Z&sortBy=updatedAt&sortOrder=ASC&page=1&limit=100
```

### Pagination Stability Warning

<Warning>
  **Offset-based pagination may produce inconsistent results** if records are
  updated during pagination. You may miss records or see duplicates.
</Warning>

**Mitigation strategies:**

1. Use smaller time windows for `updatedAfter`
2. Always deduplicate by `fileId + updatedAt`
3. Re-sync periodically with a larger time window to catch missed records

***

## What Triggers `updatedAt`?

The `updatedAt` timestamp changes when any of these events occur:

| Event                     | Updates `updatedAt` |
| ------------------------- | ------------------- |
| File metadata changes     | ✅ Yes               |
| Schema/form field updates | ✅ Yes               |
| Document status changes   | ✅ Yes               |
| Form filling/reprocessing | ✅ Yes               |
| Tag modifications         | ✅ Yes               |
| Contact/vendor updates    | ✅ Yes               |
| Workflow step changes     | ✅ Yes               |
| Classification changes    | ✅ Yes               |
| OCR reprocessing          | ✅ Yes               |

***

## Delete Handling

<Warning>
  **Important**: Deleted files are **excluded** from the `GET /files` response
  and do **not** appear in `updatedAfter` queries. This endpoint provides
  **insert/update only** — not full CDC.
</Warning>

### Tracking Deletions

If you need to detect deleted files:

<Steps>
  <Step title="Option A: Periodic Full Sync">
    Do a full sync periodically and compare with your existing data to detect
    missing records.
  </Step>

  <Step title="Option B: Deletion Events Endpoint">
    Contact the fileAI team about a dedicated deletion events endpoint (roadmap
    item).
  </Step>
</Steps>

***

## Best Practices

<AccordionGroup>
  <Accordion title="1. Store Your Watermark" icon="bookmark">
    After each successful sync, store the **maximum `updatedAt`** value from the batch:

    ```python theme={null}
    # Pseudocode
    last_sync_watermark = max(record['updatedAt'] for record in batch)
    save_watermark(last_sync_watermark)
    ```
  </Accordion>

  <Accordion title="2. Use Recommended Query Parameters" icon="sliders">
    Always include sorting parameters for reliable CDC:

    ```bash theme={null}
    GET /files?updatedAfter={watermark}&sortBy=updatedAt&sortOrder=ASC&limit=100
    ```
  </Accordion>

  <Accordion title="3. Handle Pagination Completely" icon="layer-group">
    Paginate through **all pages** before updating your watermark:

    ```python theme={null}
    # Pseudocode
    page = 1
    all_records = []

    while True:
        response = get_files(updatedAfter=watermark, page=page, limit=100)
        all_records.extend(response['files'])

        if len(response['files']) < 100:
            break
        page += 1

    # Only update watermark after all pages are processed
    if all_records:
        new_watermark = max(r['updatedAt'] for r in all_records)
        save_watermark(new_watermark)
    ```
  </Accordion>

  {' '}

  <Accordion title="4. Deduplicate Before Loading" icon="filter">
    Always deduplicate records by `fileId + updatedAt` before inserting into your
    data warehouse.
  </Accordion>

  <Accordion title="5. Schedule Periodic Full Syncs" icon="calendar">
    Run full syncs (e.g., weekly) to:

    * Catch any records missed due to pagination issues
    * Detect deleted records by comparing with existing data
  </Accordion>
</AccordionGroup>

***

## Snowflake Integration Example

<Tabs>
  <Tab title="Initial Load">
    ```sql theme={null}
    -- First sync: load all files
    COPY INTO raw_files
    FROM (
      SELECT $1:fileId, $1:fileName, $1:updatedAt, ...
      FROM @fileai_stage/files.json
    );
    ```
  </Tab>

  <Tab title="Incremental Load">
    ```sql theme={null}
    -- Create staging table for incremental data
    CREATE TEMPORARY TABLE staging_files AS
    SELECT * FROM raw_files WHERE 1=0;

    -- Load incremental data
    COPY INTO staging_files
    FROM (
      SELECT $1:fileId, $1:fileName, $1:updatedAt, ...
      FROM @fileai_stage/incremental_files.json
    );

    -- Merge with deduplication
    MERGE INTO raw_files t
    USING (
      SELECT * FROM staging_files
      QUALIFY ROW_NUMBER() OVER (
        PARTITION BY file_id
        ORDER BY updated_at DESC
      ) = 1
    ) s
    ON t.file_id = s.file_id
    WHEN MATCHED AND s.updated_at > t.updated_at THEN
      UPDATE SET
        file_name = s.file_name,
        updated_at = s.updated_at,
        ...
    WHEN NOT MATCHED THEN
      INSERT (file_id, file_name, updated_at, ...)
      VALUES (s.file_id, s.file_name, s.updated_at, ...);
    ```
  </Tab>
</Tabs>

***

## Response Schema

```json theme={null}
{
  "files": [
    {
      "fileId": "53d6a0b1-2a8d-4ed9-9e6a-ceaef7ca3908",
      "fileName": "invoice.pdf",
      "fileType": "application/pdf",
      "fileSize": 261928,
      "fileStoragePath": "path/to/file.pdf",
      "fileHash": "be3ef5fbb21e31c1281300b23b1c918a8ba54427c799aea21865f68d5efd01b7",
      "uploadId": "f2538513-f0b9-4aa8-9c57-bc0a85c77de6",
      "status": "processed",
      "currency": "USD",
      "summary": "Invoice from Example Corp",
      "duplicateToFileId": null,
      "referenceId": "1e70a5e860",
      "isDuplicate": false,
      "isEmbedded": false,
      "schemaId": "6835aca6030a79ffaabca742",
      "fileClass": "Invoice",
      "fileContactId": "6835aca2281d9ed1bab90b11",
      "fileContactName": "Example Company",
      "createdAt": "2025-05-27T12:14:24.258Z",
      "updatedAt": "2025-05-27T14:30:00.123Z"
    }
  ],
  "count": 1,
  "currentPage": 1
}
```

***

## Quick Reference

| Feature                  | Behavior                           |
| ------------------------ | ---------------------------------- |
| Filter semantics         | `>=` (inclusive)                   |
| Timestamp format         | ISO 8601, UTC, milliseconds        |
| Pagination               | Offset-based (`page`, `limit`)     |
| Default sort             | `createdAt ASC`                    |
| Recommended sort for CDC | `updatedAt ASC`                    |
| Delete visibility        | ❌ Not visible (insert/update only) |
| Deduplication required   | ✅ Yes, by `fileId + updatedAt`     |

***

## Need Help?

<Card title="Contact Support" icon="envelope" href="mailto:support@file.ai">
  Reach out to the fileAI engineering team for questions about CDC
  implementation or to request new features.
</Card>
