1. Globus Search API Overview

This document is intended to provide a high-level overview of the Globus Search (hereafter just "Search") API, introducing basic concepts and terminology.

This section should be sufficient to grasp the basic functioning of Search and how it can be used and applied to various use cases. It also discusses some of the more subtle design choices and their impact.

1.1. What is it for?

The Search API allows you to store metadata
[We intentionally keep the term "metadata" vague and generic. Search can store whatever data you want to make searchable, and its up to your application to decide what is and is not useful to store.]
, setting policy on its visibility and on who is allowed to modify it, and then retrieve that metadata through search queries.

A few examples of the service’s capabilities include:

  • On documents describing tweets, sorting by number of retweets

  • On documents describing novels, filtering to those published in 1870 and 1890, but not the years inbetween

  • On documents describing clothing, counting the number of matching items which contain cotton, polyester, or both

  • On documents describing academic papers, counts of those with Paul Erdos as an author, but not as the first author

Search is not "batteries included", however: in order for it to be able to do anything, you first need to provide it with your metadata.

1.2. Search Indices

A Search index is a place for you to store and search over your metadata. Every operation done against search — storing metadata, setting permissions, and performing searches — is done with respect to a particular index.

Indices can be used to create logical groupings of metadata for your applications to use. They also provide a level of control and separation between sets of metadata, and can be used to enforce policies about who can write different datasets.

You might only have one index, or you might have one for testing and one for production data. You could want an index for each of your sources of metadata, or separate indices to describe data at separate stages in a workflow. The production vs pre-production/testing example is particularly powerful, and is in fact a pattern which we recommend. Using a separate index for production than development allows developers to have a high level of access when working on new features, but protects production from potentially harmful updates and deletions.

You would not, however, typically use a separate index for public vs. private data: for that, you can employ Search’s visibility features.

1.3. Metadata Format

To use the Search API, it’s necessary to understand the two basic concepts which are used to break down metadata:

  • Subjects

  • Entries

1.3.1. The Subject

Metadata stored in an index is meant to describe or annotate a particular entity a user may wish to locate or discover. We term that entity the subject of the metadata.

The principal representation of a search result is the subject, and typically, as in Google-style web searches, the subject represents a network accessible object. It therefore follows that each subject is expected to be a URL, and subjects are required to be unique per index.

We consider each subject to be an individual search result, and any metadata about the subject to be "reasons" why the subject was returned in a search or "descriptions" of that subject. As such, the subject is fundamental, and the data surrounding it is keyed off of it.

In structuring metadata for Search indexing, you should align the subject with the "most important" parts of your data — e.g. handles linking to papers are good subjects for an index of paper abstracts.

Note:

There is a possibility that the metadata surrounding subjects may be more important to your application than the subject itself. These are rare and unusual cases, but in these scenarios your subjects merely need to be identifiers for chunks of data.

1.3.2. The Entry

Even though a subject may only appear once per index, Search permits multiple pieces of metadata to be applied to that subject. We refer to each of these metadata blocks as an entry (a common term in our documentation, GMetaEntry, is the API document type which represents an entry — they are more or less synonymous).

The reasons for allowing multiple entries to reference a single subject are numerous:

  • Different users may provide distinct bits of metadata for the subject and we wish to permit them to do that independently of one another

  • Different sections of metadata may have different access control restrictions. Different entries may have different visiblity rules even though they describe the same subject. Entries are therefore the smallest unit of visibility control

  • Entries may represent blocks of metadata which are tightly bound and so searching across fields should rank more highly in results if all field values are within the same entry

To expand upon the example of an index of papers, you may have a link to each paper as a subject, then store the abstract, full text, and references as separate entries.

  • Perhaps an automated process provides the Abstracts, but Full Text and References go through a manual approval process. This requires distinct users to be able to "say things" about the same paper

  • Abstracts and References may be public, while Full Text is restricted to specific users. This requires differential visibility control

  • Searches for "dialuminum trioxide" should more highly rank results with both "dialuminum" and "trioxide" in the Abstract, and give much lower ranking to a result with "dialuminum" in the Full Text and "trioxide" in the References.

Entry IDs

Entries need to be distinguished from one another in the context of a given subject. This is done via a user-supplied value, the entry ID.

Some notes about entry IDs:

  • Entry IDs are just strings. They don’t need to conform to any specific format, but they may not contain the character !

  • Entry IDs must be unique per subject. Valid values for the Papers Index example include the simple strings "abstract", "full_text", "references" because there will only be one entry for each of these in a given subject

  • Entry IDs may be null or omitted, and these are equivalent. The uniqueness rule still applies, so only one entry per subject may have a null ID, but this is great for cases in which there is only one entry per subject —  the subject is then sufficient to uniquely specify its only entry

2. Globus Search API Usage

How to use the Globus Search API.

Globus Search is an API which communicates purely over the HTTP protocol, and primarily uses a RESTful data model.

2.1.1. URLs

The Search API is reached at https://search.api.globus.org/

URLs and URIs in this documentation will usually omit this piece, so the text

GET /foo/bar

should be interpreted as referring to an HTTP GET request against

https://search.api.globus.org/foo/bar
Globus Preview URL

In the Globus Preview environment, the Search API can be found at https://search.api.preview.globus.org

2.2. Authentication & Authorization

Some features of Search don’t have any authorization or authentication requirements. These can be accessed over HTTPS with no credentials.

2.2.1. Authorization Header

Globus Search authorizes access with Globus Auth. It therefore requires that any authenticated calls be made with a bearer Authorization header:

Authorization: Bearer <token_string>

The token string is an Access Token provided via Globus Auth.

2.2.2. Scopes

Access tokens for Globus Search have one of the following scopes:

  • urn:globus:auth:scope:search.api.globus.org:ingest — The token authorizes the call to write data into Search

  • urn:globus:auth:scope:search.api.globus.org:search — The token authorizes the call to query data from Search

  • urn:globus:auth:scope:search.api.globus.org:all — The token authorizes the activities of both ingest and search

2.2.3. Auth Errors

If an invalid token is provided, or a call which requires authorization is missing any token, Globus Search will return an HTTP 401 Unauthorized.

If the token is valid, but the call requires permissions which the user does not have within Search, the API will return an HTTP 403 Forbidden.

If a resource is considered highly sensitive, it is possible that improperly authorized calls will return an HTTP 404 Not Found.

Globus Search has its own notions of permissions which are enforced after successful authorization. In particular, a user may be

  • An admin of an index

  • A privileged_user (writer, but not admin) of an index

  • A member of the visible_to list of a document

These permissions are all evaluated against the linked identities and Globus Group memberships of a user.

3. Principal URNs

Principal URNs are a way of phrasing Globus Identities and Globus Groups as URNs.

The major improvement offered by the URN syntax is that it unambiguously associates a value with the correct type. That means that Group IDs are labeled as Groups, and Identity IDs are labeled as Identities.

3.1. What’s it look like?

Let’s jump in with some examples:

  • urn:globus:auth:identity:46bd0f56-e24f-11e5-a510-131bef46955c

  • urn:globus:groups:id:fdb38a24-03c1-11e3-86f7-12313809f035

That’s it!

Prefix Identity IDs with urn:globus:auth:identity: and Group IDs with urn:globus:groups:id:

3.2. Why URNs?

A big question is why it is not enough for us to use the IDs of Identities and Groups without this qualification. A few reasons which justify this choice:

  • Better for Humans. What is 46bd0f56-e24f-11e5-a510-131bef46955c? A Group, Identity, misplaced Endpoint ID? URNs help contextualize for humans. Knowing the type tells you which API to use to dereference an ID to an entity.

  • Unambiguous. Without type qualification, any given ID could refer to a number of different entity types. Without knowing the type, it is not always possible to deduce what an ID refers to. e.g. Given the ID of a deleted group it may not be possible to know its type, as it can no longer be resolved by the Groups service.

  • Portable syntax. The syntax is recognizable and parseable across a broad range of services. We can now treat this as a global and uniform way of stringifying these identifiers.

  • Better for Audits. For logging, this gives a canonical string representation of these identifiers with their associated entity types. Logs can then be processed from a range of component services based on these URNs.

  • Flat and Simple Strings. Strings are the simplest, lowest-common-denominator serialization technique. We could use objects, like type: identity, value: 46bd0f56-e24f-11e5-a510-131bef46955c, but that then needs to be represented in different ways in different places (logs, memory, and databases, to start with). URNs are just strings, and look the same everywhere.

3.3. Case Sensitivity

Principal URNs are always returned as all-lowercase strings. They are considered case-insensitive on input, but we recommend sending them in lowercase to simplify any comparisons you might perform.

3.4. Look Them Up

How can you lookup these values?

3.4.1. Identities

To lookup identities, you need to use the Globus Auth Identities API: https://docs.globus.org/api/auth/reference/#v2_api_identities_resources

A nice and easy way of doing interactive lookups is the Globus CLI:

$ # given urn:globus:auth:identity:46bd0f56-e24f-11e5-a510-131bef46955c
$ globus get-identities 46bd0f56-e24f-11e5-a510-131bef46955c
globus@globus.org

$ # given urn:globus:auth:identity:c0a6b8ac-d274-11e5-bf7e-f33abd9d8cc8
$ globus get-identities c0a6b8ac-d274-11e5-bf7e-f33abd9d8cc8
demo@globus.org

3.4.2. Groups

Given a couple of Group URNs, the same principle applies

  • urn:globus:groups:id:fdb38a24-03c1-11e3-86f7-12313809f035

  • urn:globus:groups:id:fe234176-abe4-11e4-90a3-22000aa401f6

can be seen at

respectively.

There is no public API for fetching Group information.


© 2010- The University of Chicago Legal