1. Globus Search API Overview
This document is intended to provide a high-level overview of the Globus Search (hereafter just "Search") API, introducing basic concepts and terminology.
This section should be sufficient to grasp the basic functioning of Search and how it can be used and applied to various use cases. It also discusses some of the more subtle design choices and their impact.
1.1. What is it for?
The Search API allows you to store metadata
[We intentionally keep the term "metadata" vague and generic. Search can store whatever data you want to make searchable, and its up to your application to decide what is and is not useful to store.]
, setting policy on its visibility and on who is allowed to modify it, and then retrieve that metadata through search queries.
A few examples of the service’s capabilities include:
On documents describing tweets, sorting by number of retweets
On documents describing novels, filtering to those published in 1870 and 1890, but not the years inbetween
On documents describing clothing, counting the number of matching items which contain cotton, polyester, or both
On documents describing academic papers, counts of those with Paul Erdos as an author, but not as the first author
Search is not "batteries included", however: in order for it to be able to do anything, you first need to provide it with your metadata.
1.2. Search Indices
A Search index is a place for you to store and search over your metadata. Every operation done against search — storing metadata, setting permissions, and performing searches — is done with respect to a particular index.
Indices can be used to create logical groupings of metadata for your applications to use. They also provide a level of control and separation between sets of metadata, and can be used to enforce policies about who can write different datasets.
You might only have one index, or you might have one for testing and one for production data. You could want an index for each of your sources of metadata, or separate indices to describe data at separate stages in a workflow. The production vs pre-production/testing example is particularly powerful, and is in fact a pattern which we recommend. Using a separate index for production than development allows developers to have a high level of access when working on new features, but protects production from potentially harmful updates and deletions.
You would not, however, typically use a separate index for public vs. private data: for that, you can employ Search’s visibility features.
1.3. Metadata Format
To use the Search API, it’s necessary to understand the two basic concepts which are used to break down metadata:
1.3.1. The Subject
Metadata stored in an index is meant to describe or annotate a particular entity a user may wish to locate or discover. We term that entity the subject of the metadata.
The principal representation of a search result is the subject, and typically, as in Google-style web searches, the subject represents a network accessible object. It therefore follows that each subject is expected to be a URL, and subjects are required to be unique per index.
We consider each subject to be an individual search result, and any metadata about the subject to be "reasons" why the subject was returned in a search or "descriptions" of that subject. As such, the subject is fundamental, and the data surrounding it is keyed off of it.
In structuring metadata for Search indexing, you should align the subject with the "most important" parts of your data — e.g. handles linking to papers are good subjects for an index of paper abstracts.
There is a possibility that the metadata surrounding subjects may be more important to your application than the subject itself. These are rare and unusual cases, but in these scenarios your subjects merely need to be identifiers for chunks of data.
1.3.2. The Entry
Even though a subject may only appear once per index, Search permits multiple
pieces of metadata to be applied to that subject.
We refer to each of these metadata blocks as an entry (a common term in our
GMetaEntry, is the API document type which represents an entry — they
are more or less synonymous).
The reasons for allowing multiple entries to reference a single subject are numerous:
Different users may provide distinct bits of metadata for the subject and we wish to permit them to do that independently of one another
Different sections of metadata may have different access control restrictions. Different entries may have different visiblity rules even though they describe the same subject. Entries are therefore the smallest unit of visibility control
Entries may represent blocks of metadata which are tightly bound and so searching across fields should rank more highly in results if all field values are within the same entry
To expand upon the example of an index of papers, you may have a link to each paper as a subject, then store the abstract, full text, and references as separate entries.
Perhaps an automated process provides the Abstracts, but Full Text and References go through a manual approval process. This requires distinct users to be able to "say things" about the same paper
Abstracts and References may be public, while Full Text is restricted to specific users. This requires differential visibility control
Searches for "dialuminum trioxide" should more highly rank results with both "dialuminum" and "trioxide" in the Abstract, and give much lower ranking to a result with "dialuminum" in the Full Text and "trioxide" in the References.
Entries need to be distinguished from one another in the context of a given subject. This is done via a user-supplied value, the entry ID.
Some notes about entry IDs:
Entry IDs are just strings. They don’t need to conform to any specific format, but they may not contain the character
Entry IDs must be unique per subject. Valid values for the Papers Index example include the simple strings
"abstract", "full_text", "references"because there will only be one entry for each of these in a given subject
Entry IDs may be null or omitted, and these are equivalent. The uniqueness rule still applies, so only one entry per subject may have a null ID, but this is great for cases in which there is only one entry per subject — the subject is then sufficient to uniquely specify its only entry
2. Globus Search API Usage
How to use the Globus Search API.
2.1. Communicating with Globus Search
Globus Search is an API which communicates purely over the HTTP protocol, and primarily uses a RESTful data model.
The Search API is reached at
URLs and URIs in this documentation will usually omit this piece, so the text
should be interpreted as referring to an HTTP GET request against
Globus Preview URL
In the Globus Preview environment, the Search API can be found at
2.2. Authentication & Authorization
Some features of Search don’t have any authorization or authentication requirements. These can be accessed over HTTPS with no credentials.
2.2.1. Authorization Header
Globus Search authorizes access with Globus Auth. It therefore requires that
any authenticated calls be made with a bearer
Authorization: Bearer <token_string>
The token string is an Access Token provided via Globus Auth.
Access tokens for Globus Search have one of the following scopes:
urn:globus:auth:scope:search.api.globus.org:ingest— The token authorizes the call to write data into Search
urn:globus:auth:scope:search.api.globus.org:search— The token authorizes the call to query data from Search
urn:globus:auth:scope:search.api.globus.org:all— The token authorizes the activities of both
2.2.3. Auth Errors
If an invalid token is provided, or a call which requires authorization is
missing any token, Globus Search will return an HTTP
If the token is valid, but the call requires permissions which the user does
not have within Search, the API will return an HTTP
If a resource is considered highly sensitive, it is possible that improperly
authorized calls will return an HTTP
404 Not Found.
2.2.4. Permissions in Search
Globus Search has its own notions of permissions which are enforced after successful authorization. In particular, a user may be
adminof an index
privileged_user(writer, but not
admin) of an index
A member of the
visible_tolist of a document
These permissions are all evaluated against the linked identities and Globus Group memberships of a user.
3. Principal URNs
Principal URNs are a way of phrasing Globus Identities and Globus Groups as URNs.
The major improvement offered by the URN syntax is that it unambiguously associates a value with the correct type. That means that Group IDs are labeled as Groups, and Identity IDs are labeled as Identities.
3.1. What’s it look like?
Let’s jump in with some examples:
Prefix Identity IDs with
urn:globus:auth:identity: and Group IDs with
3.2. Why URNs?
A big question is why it is not enough for us to use the IDs of Identities and Groups without this qualification. A few reasons which justify this choice:
Better for Humans. What is
46bd0f56-e24f-11e5-a510-131bef46955c? A Group, Identity, misplaced Endpoint ID? URNs help contextualize for humans. Knowing the type tells you which API to use to
dereferencean ID to an entity.
Unambiguous. Without type qualification, any given ID could refer to a number of different entity types. Without knowing the type, it is not always possible to deduce what an ID refers to. e.g. Given the ID of a deleted group it may not be possible to know its type, as it can no longer be resolved by the Groups service.
Portable syntax. The syntax is recognizable and parseable across a broad range of services. We can now treat this as a global and uniform way of stringifying these identifiers.
Better for Audits. For logging, this gives a canonical string representation of these identifiers with their associated entity types. Logs can then be processed from a range of component services based on these URNs.
Flat and Simple Strings. Strings are the simplest, lowest-common-denominator serialization technique. We could use objects, like
type: identity, value: 46bd0f56-e24f-11e5-a510-131bef46955c, but that then needs to be represented in different ways in different places (logs, memory, and databases, to start with). URNs are just strings, and look the same everywhere.
3.3. Case Sensitivity
Principal URNs are always returned as all-lowercase strings. They are considered case-insensitive on input, but we recommend sending them in lowercase to simplify any comparisons you might perform.
3.4. Look Them Up
How can you lookup these values?
To lookup identities, you need to use the Globus Auth Identities API: https://docs.globus.org/api/auth/reference/#v2_api_identities_resources
A nice and easy way of doing interactive lookups is the Globus CLI:
$ # given urn:globus:auth:identity:46bd0f56-e24f-11e5-a510-131bef46955c $ globus get-identities 46bd0f56-e24f-11e5-a510-131bef46955c email@example.com $ # given urn:globus:auth:identity:c0a6b8ac-d274-11e5-bf7e-f33abd9d8cc8 $ globus get-identities c0a6b8ac-d274-11e5-bf7e-f33abd9d8cc8 firstname.lastname@example.org
Given a couple of Group URNs, the same principle applies
can be seen at
There is no public API for fetching Group information.