Crawling

We support crawling and indexing media alongside text to enable multimodal search. We currently support:

  • Images

When building an Index, you may specify fields as "crawlable" (see Index Configuration). If the field contains a URL to a resource that matches one of the supported MIME types then that field will be crawled.

Supported fields

Fields may be either a single string or an array of strings containing only URLs. For example:

  • "field_name": "https://abc.com/xyz.png"
  • "images": ["https://abc.com/xyz.png", "https://abc.com/123.jpg"]

Supported media types

We support the following MIME types:

  • Images
    • image/jpeg
    • image/png
    • image/apng
    • image/x-png
    • image/gif
    • image/webp
    • image/avif
    • image/tiff
    • image/svg
    • image/svg+xml

Media type limits

Images are limited to 34 megapixels (MP), enough for an 8K UHD (7680 × 4320) resolution image.

Accessing secure content

We support multiple mechanisms for accessing secure content. If you don't see a mechanism that meets your need, please get in touch, we'd love to work with you to support it!

Signed URLs / Tokens

S3 signed URL

Generate temporary signed URLs for S3 resources that are included in objects pushed to our API.

Custom token auth Add an access token to the URL that you push to the Objective Search API. For example, if you push the following object with a secure resource:

{
  "secure_page": "https://mysecuresite.com"
}

Update the server that serves this page to support a token query parameter for authentication. Then, update the object you send to Objective Search to contain a temporary access token:

{
  "secure_page": "https://mysecuresite.com?token=abcdef123"
}

IAM role

For resources hosted on AWS, we can create an IAM role that has access to the resource such that our crawler can retrieve the resource. Please get in touch, for more info about this method.

HTTP Header

If the resource can be accessed using an HTTP auth method such as Bearer token or similar we can issue a token & utilize that token during the crawling process. Please get in touch, for more info about utilizing this method.

How are resources updated?

Our crawling caches results using the URL as the key. This means that once we crawl the resource once we will not update the resource's representation in the index unless the URL changes. This is to avoid stress on the servers the resource is being retrieved from.

Forcing a resource to be updated

To force a resource to be updated simply update the URL to the resource. One common strategy here is to insert a timestamp into the URL. For example, if you have https://example.com and you want to update the representation in the Catalog, update the URL to https://example.com?updated_at=YYYYmmddTHHMMSS. This will cause our crawler to fetch the resource when the object updates.

Not currently supported

  • Crawling non-public content
  • Crawling URLs from within unstructured content (e.g. text)
  • Non-HTTP resources (e.g. S3, etc..)

Was this page helpful?