Tools

SHEPHARD comes with a collection of tools for working with the various

Format for function signatures

Domain tools functions

Domain-associated functions are located in the shephard.tools.domain_tools module. These tools enable analysis, construction, and manipulation to Domain objects.

domain_overlap(domain_1, domain_2, check_origin=True)[source]

Given two domains asks if their boundaries overlap. By default this expects the two domains in question to be from the same protein and checks this. If we dont want to enforce this assumption set check_origin to False.

Parameters:
  • domain_1 (shephard.domain.Domain) – The first domain object of interest

  • domain_2 (shephard.domain.Domain) – The first domain object of interest

  • check_origin (bool) – Flag that if set to True will cause an exception if domain_1 and domain_2 are from different proteins. If set to false, no such sanity checks are performed.

Returns:

Returns true if the two domains overlap, else returns false

Return type:

boolean

domain_overlap_fraction(domain_1, domain_2, check_origin=True)[source]

Given two domains asks what fraction the shorter domain overlaps the longer one with. :param domain_1: The first domain object of interest :type domain_1: shephard.domain.Domain :param domain_2: The first domain object of interest :type domain_2: shephard.domain.Domain :param check_origin: Flag that if set to True will cause an exception if domain_1 and

domain_2 are from different proteins. If set to false, no such sanity checks are performed.

Returns:

Returns a float between 0 and 1 that corresponds to what fraction of the shorter domain overlaps with the longer domain.

Return type:

float

domain_overlap_by_position(boundary_start1, boundary_end1, boundary_start2, boundary_end2)[source]

Given four sets of starting/ending positions, this function asks if their boundaries overlap.

Parameters:
  • boundary_start1 (int) – Position of domain 1 start

  • boundary_end (int) – Position of domain 1 end

  • boundary_start2 (int) – Position of domain 2 start

  • boundary_end – Position of domain 2 end

Returns:

Returns true if the two domains overlap, else returns false

Return type:

boolean

build_missing_domains(protein, new_domain_type='missing')[source]

Function which takes a protein and builds a set of domains that represent the “empty spaces”. Domains are returned as a list of domain dictionaries which can be added to a protein via the add_domains() function.

This tool is stateless - i.e. it does not alter the passed protein but instead only generates a numerical list which could be added as a track.

One could always combine this directly with the add_domains() function into a single line - e.g.

# this line will automatically add all the missing regions as # ‘missing’ proteins to the domain protein.add_domains(build_missing_domains(protein))

Parameters:
  • protein (shephard.protein.Protein object) – Protein object over which sites are identified

  • new_domain_type (str (default = 'missing')) – Name to assign to the ‘empty’ domains.

Returns:

Returns a list of domain dictionaries which can be then parsed or added to a protein via the add_domains() function.

Return type:

list of domain dictionaries

build_domains_from_track_values(proteome, track_name, binerize_function, domain_type, gap_closure=3, minimum_region_size=20, extend_ends=None, verbose=True)[source]

Function which takes a Proteome and builds a set of domains based on values tracks in each Protein in that Proteome. This effectively allows you to discretize some continous variable into distinct local domains, which can often facilitate specific types of analysis. This conversion is done using a custom-passed binerize function which converts a normal track into a track of 0s and 1s. Residues that are assigned a value of 1 will be included in a domain assuming they fall within a contigous region of sufficient size, as defined by the parameters gap_closure and minimum_region_size, as discussed below.

This function operates on an entire Proteome-level, and is stateless (i.e. does not directly alter the passed proteome). Instead, the function dictionary where keys are unique_IDs of proteins and values is a list of one or more Domain dictioinaries (with a start, end, and domain_type key:value pair).

The domains dictionary can be added to a proteome using the si_domains.add_domains_from_dictionary(). As an example, as possible workflow is as follows:

>>> d = build_domains_from_track_values(proteome, 'cool_track', trackfx)
>>> si_domains.add_domains_from_dictionary(proteome, d)

Under the hood, the function works by cycling through each protein, extracting the track, and converting into domains.

If a protein is too short or it lacks a given track, the protein is skipped.

Parameters:
  • proteome (shephard.proteome.Proteome) – The Proteome which is going to be scanned for each track. Note that the underlying Proteome is not altered by this function

  • track_name (string) – Name of the track to convert. If the track name does not exist in a given protein that protein is skipped. In this way, a Proteome where only a subset of Proteins have tracks can be parsed without issue. The track must be a values track - symbols tracks should be converted to a values track first to avoid issue.

  • binerize_function (function) – A function which takes a track and converts it to 0 or 1 (binerize, as in, make binary). This enables a complex and continous track to be converted into a binary classification, which is practically what a domain-assigment needs (yes/no inside domain). This function must take in a single variable (the track values) and return a new list or numpy array that is the same length as the track values but possesses only 0 and 1 in each element.

  • domain_type (str) – String that defines the name of the new domains to create. Can in principle be anything.

  • gap_closure (int (default = 3)) – Defines spacing between 1s or 0s that will be filled in to generate contigous stretches of 0s or 1s. This helps avoid a scenario where breaks in contigous stretches impede the definition of a domain above a certain size, as defined by minimum_region_size. In general a value of 3 works reasonably well in most scenarios.

  • minimum_region_size (int (default = 20)) – Defines the smallest size for a domain allowed. This can be varied depending on the question or data, and it may make sense to have corresponding changes in gap_closure if this value becomes substantially larger than a gap_closure of 3.

  • extend_ends (int (default = None)) – This is a somewhat niche feature which, if set to a number, means that we check the extend_ends-th value at the N- and C-terminus of the binarized track, and if 1 set all values from that position to the N and/or C terminus to 1. This is provided because sometimes binerize functions will inherently struggle with the very ends of sequences, so this provides a way to cast the first and last extend_ends values to be 1. This is fairly specific and probably only worth using in a scenario where there is a clear issue

  • verbose (bool (default = True)) – This flag enables the function to print statues every 500 proteins. If the binerize function is expensive this can be good to ensure progress is proceeding.

Returns:

Returns a dictionary of key-value pairs, where each key is a unique ID and each value is a list of 1 or more domain dictionaries. This return dictionary can be directly added to a Proteome using the Proteome.add_domains_from_dictonary() function.

Return type:

dict

Sequence tools functions

Sequence-associated functions are located in the shephard.tools.sequence_tools module. These tools enable manpulation or search of sequence information.

build_mega_string(object_list, return_as_list=False)[source]

This takes a list of protein or domain SHEPHARD objects and builds a single long str object that concatinates all object.sequence elements together (i.e. a “megastring”).

Allowed types for the object_list are Protein and Domain objects.

This string can be used for simple statistical analysis of composition.

Parameters:
  • object_list (list) – List of SHEPHARD objects with object.sequence variable, for example, a list of Domains or a list of Proteins

  • return_as_list (bool) – If provided, rather than a single megastring, the function returns a list of sequences from the objects in question.

Returns:

Returns either a concatinated str object of the amino acid sequences associated with the passed object

Return type:

str or list

find_string_positions(query, target, protein_indexing=True)[source]

Returns list of start positions where stringA is in stringB - including overlaps.

Note that by default the indices use 1-indexing so that this works directly with protein sequence numbering. However, for manipulating Python strings this may be undesirable and 0 indexing may be better, in which case setting protein_indexing = False will address this.

Practically, this uses the re regex expression under the hood and searches left-to-right across the target, so if you want to get fancier with your searching you can always pass in a regular expression.

Examples

Conveninet regular expression syntax includes:

  1. '.' for wildcards (e.g. 'L.P' would match an L and P around any other character

  2. [A|C] for requiring matching of a subset of residues (e.g. residue A and C).

But the python re module has a fairly complex pattern matching ability

Parameters:
  • query (str) – The search query.

  • target (str) – The string that we’ll search for 1 or more entries of the query

  • protein_indexing (bool) – Flag which, if set to True, means the first residue in a string indexes at ‘1’ instead of ‘0’ (as would be normal in Python. If set to False, then indexing is done from 0.

Returns:

Returns a list with the start positions

Return type:

list

Site tools functions

Site-associated functions are located in the shephard.tools.site_tools module. These tools enable manpulation or search of site information.

build_site_density_vector(protein, site_types=None, window_size=30, append_leading_lagging=True)[source]

Function that constructs a sliding-window density vector of sites along a protein.

site_types is a list of one or site types.

This tool is stateless - i.e. it does not alter the passed protein but instead only generates a numerical list which could be added as a track.

Parameters:
  • protein (shephard.protein.Protein object) – Protein object over which sites are identified

  • site_type (str or list of strings) – One or more possible site_types that may be found in the protein. Either a single string or a list of strings can be passed, allowing for one or more sites to be grouped together

  • window_size (int) – Size of sliding window over which site density is calculated

  • append_leading_lagging (Bool) – Flag that if true will mean the function returns a numerical vector equal in length of the protein. If false, will return a shorter vector and not add leading/lagging values.

Returns:

Returns a list of values equal to the length of the protein, where the value at each position reports on the local denisty of sites averaged over the window_size.

Return type:

list