apis

the apis package contains a collection of tools for working with non-SHEPHARD format files. In particular, SHEPHARD uses FASTA files for sequence storage, which is not a format SHEPHARD has control over, so interaction with FASTA files occurs via APIs.

We anticipate the number of api modules to remain small, but for some tools that the lab has control over, direct interaction via an API makes sense. In general, we use interfaces for interacting with data to ensure weak coupling between SHEPHARD and other tools.

fasta

A non-uniprot FASTA file can also be read in using the fasta module

fasta_to_proteome(filename, proteome=None, build_unique_ID=None, build_attributes=None, use_header_as_unique_ID=False, force_overwrite=False, invalid_sequence_action='fail')[source]

Stand alone function that allows the user to build a Proteome from a standard FASTA file, or add sequences in a FASTA file to an existing Proteome.

This function can be used to read additional sequences into an existing Proteome object, or create a new Proteome object from a FASTA file. In addition, some control over how invalid sequences should be dealt with are defined by the invalid_sequence_action flag.

The input filename must be a FASTA file without duplicate headers. If the file has duplicate headers and these have to be further processed we suggest using the protfasta (https://protfasta.readthedocs.io/) package to parse through the FASTA file first creating a santizied input FASTA.

The FASTA file is parsed into a set of proteins, each of which has (1) a unique ID, (2) a name, (3) a sequence, and (4) optionally, a dictionary of attributes.

The protein name is defined as the full FASTA header, and the sequence based on the FASTA record sequence. Sequence validation is also provided at the point of file-parsing. The unique ID and attributes are discused below.

Each protein in a Proteome must have a unique_ID associated with it. There are three ways a FASTA file can be used to generate a unique ID:

By using the FASTA header as a unique ID, although this fails if there are non-unique FASTA headers.

By parsing the FASTA header, to extract out a unique ID. For example, FASTA files generated by other databases often include unique identifiers in a structured way, which could be extracted in a consistent manner for every FASTA record.

By incrementing an automatically unique ID, removing any dependence on the FASTA header itself.

These options can be selected using the flags provided in the function signature. Note that if both using the FASTA header directly and parsing the FASTA header are selected an exception will be raised as only one of these two can be requested simultaneously.

By default, a numerically unique value is used. Note that if your are reading FASTA files generated by UniProt, we recommend the api.uniprot functions instead of the more generic api.fasta functions.

To build protein attributes, this can in principle be parsed out of the FASTA header by providing a build_attributes() function. However, we would in general suggest that the better way to do this would be annotate a Proteome with attribibutes and the same those using the associated interfaces.si_protein_attributes functionality.

Parameters:

filename (string) – Name of the FASTA file we’re going to parse in.
proteome (Proteome (default = None)) – If a Proteome object is provided the FASTA file will be read and added to the existing proteome, whereas if set to None a new Proteome will be generated.
build_unique_ID (funct (default = None)) –
This parameter allows a user-defined function that is used to convert the FASTA header to a (hopefully) unique string. This can be useful if the FASTA header is well structured and includes a specific, useful unique ID that can be used as the unique_ID.

Specifically, the build_unique_ID function should take in the a str (a FASTA header) and return a string which will be used as a unique ID.
build_attributes (funct (default = None)) – This parameter allows a user-defined function that allows meta-information from the FASTA header to be converted into protein attributes. Specifically, build_attributes should be a function which takes in the FASTA header as a string and returns a dictionary where key:value pairs are assigned as protein attributes. This can be useful if the FASTA header is well-structured.
use_header_as_unique_ID (bool (default = False)) – If this flag is set to true, it means the unique_ID is set to the FASTA file header. If non-unique headers are found this will trigger an exception.
force_overwrite (bool (default = False)) – If this flag is set to true and we encounter a unique_ID that is already in the proteome the newer value overwrites the older one. This is mostly useful if you are adding in a file with known duplicate entries OR combining multiple FASTA files where you know there’s some duplications. Important - if we’re building unique IDs based on numerical record indices then EVERY FASTA entry will be given a unique_ID (meaning force_overwrite is irrelevant in this case).
invalid_sequence_action (str (default = 'fail')) –
Selector which defines the behaviour if a sequence with a non- standard amino acid is encountered. Valid options and their meaning are listed below:
- ignore - invalid sequences are completely ignored.
- fail - invalid sequence cause parsing to fail and throw an exception.
- remove - invalid sequences are removed.
- convert - invalid residues are converted to valid residues.
- convert-ignore - invalid sequences are converted to valid sequences and any remaining invalid residues are ignored.

Returns:

Returns an initialized Proteome object

Return type:

Proteome

proteome_to_fasta(filename, proteome, include_attributes_in_header=False)[source]

Stand alone function that allows the user to write a SHEPHARD-specific FASTA file from a Proteome object.

WARNING: The support for protein attributes in FASTA files is included mainly for easy sharing of FASTA files that are usable outside of SHEPHARD. We recommend using interfaces.si_protein_attributes functions for dealing with Protein attributes.

Parameters:

filename (str) – Name of the FASTA file we’re going to write to. We will automatically overwrite a file if it’s there, so be careful! Note that no extension is added in part because FASTA files can be .f/.fa/.fasta. Recommended a .fasta file extension.
proteome (Proteome) – The proteome object that will be written to disk
include_attributes_in_header (bool (default = falseFalse)) – Flag which if set to true means each Protein’s attributes will be included in the FASTA header. We generally do not recommend this other than times when sharing annotated FASTA files outside of a SHEPHARD ecosystem would be useful.

Returns:

No return object but a new file will be written

Return type:

None

uniprot

The uniprot module provides tools for working with uniprot data. Right now, only an automatic uniprot FASTA file parser is in place, but over time we plan to add more generic file I/O for uniprot derived files, given the robustness and broad usership.

uniprot_fasta_to_proteome(filename, proteome=None, force_overwrite=False, invalid_sequence_action='fail')[source]

Stand alone function that allows the user to build a proteome from a standard FASTA file downloaded from UniProt

This function assumes the uniprot-standard format for the header file has been maintained - i.e.

>>> >xx|ACCESSION|xxxx

Where ACCESSION is the uniprot accession and will be used as the unique_ID

Parameters:

filename (string) – Name of the FASTA file we’re going to parse in. Note the protein name will be defined as the full FASTA header for each entry.
proteome (Proteome) – If a Proteome object is provided the FASTA file will be read and added to the existing proteome, whereas if set to None a new Proteome will be generated.
force_overwrite (bool (default = False)) – If this flag is set to true and we encounter a unique_ID that is already in the proteome the newer value overwrites the older one. This is mostly useful if you are adding in a file with known duplicate entries OR combining multiple FASTA files where you know there’s some duplications. Important - if we’re building unique IDs based on numerical record indices then EVERY FASTA entry will be given a unique_ID (meaning force_overwrite is irrelevant in this case).
invalid_sequence_action (str (default = 'fail')) –
Selector which defines the behaviour if a sequence with a non- standard amino acid is encountered. Valid options and their meaning are listed below:
- ignore - invalid sequences are completely ignored
- fail - invalid sequence cause parsing to fail and throw an exception
- remove - invalid sequences are removed
- convert - invalid residues are converted to valid residues
- convert-ignore - invalid sequences are converted to valid sequences and any remaining invalid residues are ignored.

Returns:

Returns an initialized Proteome object

Return type:

Proteome

uniprot_proteome_to_fasta(filename, proteome)[source]

Stand alone function that allows the user to write a FASTA file from a Proteome under the assumption that the Proteome was built from a uniprot FASTA.

Practically, this just means that the Protein.name variable is used for the FASTA header, although the function will fail if duplicate headers are found.

Parameters:

filename (string) – Name of the FASTA file we’re going to write sequences to
proteome (Proteome) – The Proteome object from which FASTA file will be generated

Returns:

No return variable but wll write to file

Return type:

None

metapredict_api

The metapredict module provides tools for annotating proteome-scale information using metapredict. This depends on having metapredict V2-FF available, which is not defined as a hard requirement, but, if used, enables the annotation of entire proteomes in terms of IDRs and disorder scores with a single function.

annotate_proteome_with_disorder_track(proteome, name='disorder', device=None, version=3, show_progress_bar=True, safe=True)[source]

Function that annotates a proteome with disorder Tracks for every protein.

By default, disorder Tracks are named ‘disorder’, although this can be changed by setting the track_name parameter.

Disorder prediction uses the batch mode in metapredict, which leverages parallel predictions automatically on GPUs or CPUs. However, if a specific device is requested, this can be passed

Parameters:

proteome (shephard.proteome.Proteome) – Proteome object to be annotated.
track_name (str) – Name of the Track added to each Protein. Default = ‘disorder’
device (int or str) –
Identifier for the device to be used for predictions. Possible inputs: ‘cpu’, ‘mps’, ‘cuda’, or an int that corresponds to the index of a specific cuda-enabled GPU. If ‘cuda’ is specified and cuda.is_available() returns False, instead of falling back to CPU, metapredict will raise an Exception so you know that you are not using CUDA as you were expecting. Default: None

When set to None, we will check if there is a cuda-enabled GPU. If there is, we will try to use that GPU. If you set the value to be an int, we will use cuda:int as the device where int is the int you specify. The GPU numbering is 0 indexed, so 0 corresponds to the first GPU, and so on. Only specify this if you know which GPU you want to use. * Note: MPS is only supported in Pytorch 2.1 or later. MPS is still fairly new, so use it at your own risk.
version (int) – Defines the metapredict version to use (must be one of 1, 2 or 3).
show_progress_bar (bool) – Flag which, if set to True, means a progress bar is printed as predictions are made, while if False no progress bar is printed. Default = True
safe (bool) – Flag which, if set to False, means the function overwrites existing tracks and domains if present. If True, overwriting will trigger an exception. Default = True.

Returns:

No return type, but the Protein objects in the Proteome will be annotated with per-residue disorder Tracks.

Return type:

None

annotate_proteome_with_disordered_domains(proteome, name='IDR', disorder_threshold=0.5, annotate_folded_domains=False, folded_domain_name='FD', device=None, version=3, show_progress_bar=True, safe=True)[source]

Function that annotates a proteome with disordered Domains (IDRs) for every protein.

By default, disordered Domains are named as ‘IDR’s, although this can be changed by setting the name parameter.

In addition, if requested, folded domains can also be annotated as those domains which are not IDRs. These folded domains are named ‘FD’s by default, although this can be changed by setting the folded_domain_name parameter.

Disorder prediction uses the batch mode in metapredict, which leverages parallel predictions automatically on GPUs or CPUs. However, if a specific device is requested this can be passed

Parameters:

proteome (shephard.proteome.Proteome) – Proteome object to be annotated.
name (str) – Name to give IDR domains.
disorder_threshold (float) – Threshold to be used to define IDRs by the metapredict domain decomposition algorithm. The default is 0.5, and we strongly recommend sticking with this value.
annotate_folded_domains (bool) – Flag which, if included, means we ALSO annotate the regions that are not IDRs as ‘FD’ (folded domains), where the name can be changed using the folded_domain_name variable. Default = False
folded_domain_name (str) – String used to name Folded Domains. Only relevant if annotate_folded_domains is set to True. Default = ‘FD’
device (int or str) –
Identifier for the device to be used for predictions. Possible inputs: ‘cpu’, ‘mps’, ‘cuda’, or an int that corresponds to the index of a specific cuda-enabled GPU. If ‘cuda’ is specified and cuda.is_available() returns False, instead of falling back to CPU, metapredict will raise an Exception so you know that you are not using CUDA as you were expecting. Default: None

When set to None, we will check if there is a cuda-enabled GPU. If there is, we will try to use that GPU. If you set the value to be an int, we will use cuda:int as the device where int is the int you specify. The GPU numbering is 0 indexed, so 0 corresponds to the first GPU and so on. Only specify this if you know which GPU you want to use. * Note: MPS is only supported in Pytorch 2.1 or later. MPS is still fairly new, so use it at your own risk.
version (int) – Defines the metapredict version to use (must be one of 1, 2 or 3).
show_progress_bar (bool) – Flag which, if set to True, means a progress bar is printed as predictions are made, while if False no progress bar is printed. Default = True
safe (bool) – Flag which, if set to False, means the function overwrites existing tracks and domains if present. If True, overwriting will trigger an exception. Default = True.

Returns:

No return type, but the Protein objects in the Proteome will be annotated with disordered Domain annotations.

Return type:

None

annotate_proteome_with_disorder_tracks_and_disordered_domains(proteome, track_name='disorder', domain_name='IDR', disorder_threshold=0.5, annotate_folded_domains=False, folded_domain_name='FD', device=None, version=3, show_progress_bar=True, safe=True)[source]

Function that annotates a proteome with disorder Tracks and disorder Domains for every protein.

By default, disorder Tracks are named ‘disoder’, although this can be changed by setting the track_name parameter.

By default, disordered Domains are named as ‘IDR’s, although this can be changed by setting the name parameter.

In addition, if requested, folded domains can also be annotated as those domains which are not IDRs. These folded domains are named ‘FD’s by default, although this can be changed by setting the folded_domain_name parameter.

Disorder prediction uses the batch mode in metapredict, which leverages parallel predictions automatically on GPUs or CPUs. However, if a specific device is requested this can be passed

Parameters:

proteome (shephard.proteome.Proteome) – Proteome object to be annotated.
track_name (str) – Name of the Track added to each Protein. Default = ‘disorder’
domain_name (str) – Name of the Domain added to each Protein. Default = ‘IDR’
disorder_threshold (float) – Threshold to be used to define IDRs by the metapredict domain decomposition algorithm. Default is 0.5 and strongly recommend sticking with this value.
annotate_folded_domains (bool) – Flag which, if included, means we ALSO annotate the regions that are not IDRs as ‘FD’ (folded domains), where the name can be changed using the folded_domain_name variable. Default = False
folded_domain_name (str) – String used to name Folded Domains. Only relevant if annotate_folded_domains is set to True. Default = ‘FD’
device (int or str) –
Identifier for the device to be used for predictions. Possible inputs: ‘cpu’, ‘mps’, ‘cuda’, or an int that corresponds to the index of a specific cuda-enabled GPU. If ‘cuda’ is specified and cuda.is_available() returns False, instead of falling back to CPU, metapredict will raise an Exception so you know that you are not using CUDA as you were expecting. Default: None

When set to None, we will check if there is a cuda-enabled GPU. If there is, we will try to use that GPU. If you set the value to be an int, we will use cuda:int as the device where int is the int you specify. The GPU numbering is 0 indexed, so 0 corresponds to the first GPU and so on. Only specify this if you know which GPU you want to use. * Note: MPS is only supported in Pytorch 2.1 or later. MPS is still fairly new, so use it at your own risk.
version (int) – Defines the metapredict version to use (must be one of 1, 2 or 3).
show_progress_bar (bool) – Flag which, if set to True, means a progress bar is printed as predictions are made, while if False no progress bar is printed. Default = True
safe (bool) – Flag which, if set to False, means the function overwrites existing tracks and domains if present. If True, overwriting will trigger an exception. Default = True.

Returns:

No return type, but the Protein objects in the Proteome will be annotated with per-residue disorder Tracks and disordered Domain annotations.

Return type:

None

albatross_api

The ALBATROSS API interfaces with the radius of gyration (Rg) and end-to-end distance (Re) predictions provided by ALBATROSS (Lotthammer et al. Nat. Meth. 2024). This depends on sparrow being installed, but enables you to annotate at either a proteine level or a domain level sequences with predicted Rg and Re values.

annotate_proteome_with_dimensions(proteome, rg_name='rg', re_name='re', gpuid=0, show_progress_bar=True, batch_mode=None, safe=True)[source]

Function that annotates a proteome with it’s predicted radius of gyration (rg) and end-to-end distance (re) for every protein.

By default, rg and re are added as attributes to each Protein, with the names ‘rg’ and ‘re’ respectively. However, this can be changed by setting the rg_name and re_name parameters.

Dimension prediction uses the batch mode in sparrow, which leverages parallel predictions automatically on GPUs or CPUs. However, if a specific device is requested, this can be passed via the gpuid parameter.

Parameters:

proteome (shephard.proteome.Proteome) – Proteome object to be annotated.
rg_name (str) – Name of the rg attribute added to each Protein.
re_name (str) – Name of the re attribute added to each Protein.
gpuid (int) – Identifier for the GPU being requested. Note that if this is left unset the code will use the first GPU available and if none is available will default back to CPU; in general, it is recommended not to try and set this unless there’s a specific reason why a specific GPU should be used. Default = 0.
show_progress_bar (bool) – Flag which, if set to True, means a progress bar is printed as predictions are made, while if False no progress bar is printed. Default = True
safe (bool) – Flag which, if set to False, means the function overwrites existing tracks and domains if present. If True, overwriting will trigger an exception. Default = True.

Returns:

No return type, but the Protein objects in the Proteome will be annotated with per-residue disorder Tracks.

Return type:

None

annotate_domains_with_dimensions(proteome, domain_type, rg_name='rg', re_name='re', gpuid=0, show_progress_bar=True, batch_mode=None, safe=True)[source]

Function that annotates every domain matching the domain_name in a proteome with it’s predicted radius of gyration (rg) and end-to-end distance (re).

By default, rg and re are added as attributes to each Domain, with the names ‘rg’ and ‘re’ respectively. However, this can be changed by setting the rg_name and re_name parameters.

Dimension prediction uses the batch mode in sparrow, which leverages parallel predictions automatically on GPUs or CPUs. However, if a specific device is requested, this can be passed via the gpuid parameter.

Parameters:

proteome (shephard.proteome.Proteome) – Proteome object to be annotated.
domain_type (str) – Type of the domain to be annotated.
rg_name (str) – Name of the rg attribute added to each Protein.
re_name (str) – Name of the re attribute added to each Protein.
gpuid (int) – Identifier for the GPU being requested. Note that if this is left unset the code will use the first GPU available and if none is available will default back to CPU; in general, it is recommended not to try and set this unless there’s a specific reason why a specific GPU should be used. Default = 0.
show_progress_bar (bool) – Flag which, if set to True, means a progress bar is printed as predictions are made, while if False no progress bar is printed. Default = True
safe (bool) – Flag which, if set to False, means the function overwrites existing tracks and domains if present. If True, overwriting will trigger an exception. Default = True.

Returns:

No return type, but the Protein objects in the Proteome will be annotated with per-residue disorder Tracks.

Return type:

None