Protein

Protein overview

Protein objects are the major unit of data storage in SHEPHARD, and each Proteomes is made up of zero or more Protein object. Each Protein object is associated with one amino acid sequence, and a collection of various possible annotations outlined below.

Proteins contain four possible types of metadata:

  1. Attributes: Arbitrary key-value pairings, where the key must be a string and the value can be any Python object (str, int, dict, figure handle, simulation object, lambda function, or any other variable). While attributes with complex datatypes cannot be easily saved within SHEPHARD, simple datatypes (i.e. those that can be cast to strings) can be written and read using the si_protein_attributes module.

  2. Tracks: Vectors that are the same length of the protein, are either numerical or symbolic, and contain residue-specific metadata projected over the whole sequence

  3. Domains: Continuous sub-regions within the protein

  4. Sites: Individual site positions within the protein

Each of these can be accessed using associated functions.

Protein indexing

In the field of biology, the indexing system used to describe regions and sites (1) starts at 1 and (2) is inclusive.

For example, if we had a sequence of MAEPQRDG and wanted the region defined by 2-4 this would reflect AEP. In contrast, the Python programming language (1) indexes from 0 and (2) is exclusive for ranges, so (using slice notation, for example) MAEPQRDG[2:4] would yield EP.

Within SHEPHARD, all user-facing tools operate using biology-style indexing. This means you can read in data directly from native databases without worrying about conversion, and use the same number scheme always. Because of this, to sub-select regions of the protein sequence we STRONGLY recommend using the functions .get_sequence_region(start,end) or .get_sequence_context(pos), rather than directly slicing the .sequence property.

As an example:

from shephard import Proteome
P = Proteome()

# add a protein with the sequence AACCDDEEFF, the name 'my coold protein'
# and the unique_ID 'test1'
P.add_protein('AACCDDEEFF', 'my cool protein', 'test1')

# slice notation into sequence (BAD)
print(P.protein('test1').sequence[1:3])
>>> AC

# using get_sequence_region (GOOD)
print(P.protein('test1').get_sequence_region(1,3))
>>> AAC

# using get_residue to excise a specific residue; the second
# residue should be an A
print(P.protein('test1').residue(2))
>>> A
class Protein(seq, name, proteome, unique_ID, attributes=None)[source]

Protein properties

name()

Returns the protein name.

Returns:

Returns a string that corresponds to the region of interest

Return type:

str

proteome()

Returns the Proteome object this protein is associated with.

Returns:

Returns a Proteome object that contains this Protein.

Return type:

Proteome

sequence()

Returns the protein amino acid sequence as a Python string (str). Recall that in strings indexing occurs from 0 and is non-inclusive. For proteins/biology indexing is from 1 and is inclusive.

i.e. for sequence ‘MAPSTA…’ real/biological indexing of region 1-3 would give you ‘MAP’ while Python’s indexing would give you ‘AP’.

As a result BEWARE if using the raw sequence for analysis! The Protein class provides a get_sequence_region(), get_sequence_context() and analogous functions for tracks that allow you to use normal indexing to select ranges or regions around a specific point. We suggest this is a safer way to extract vectorial information.

Returns:

Amino acid sequence associated with the protein.

Return type:

str

unique_ID()

Returns the protein’s unique_ID

Returns:

Returns the protein’s unique_ID

Return type:

str

Sequence functions

residue(self, position)

Function that returns the natural residue found at a given position.

Parameters:

position (int) – Position of interest.

Returns:

Returns a single character that corresponds to the string of interest.

Return type:

str

get_sequence_region(self, start, end)

Function that allows a region of the sequence to be extracted out.

Parameters:
  • start (int) – Start position for region

  • end (int) – End position for region (note this is inclusive)

Returns:

Returns a string that corresponds to the region of interest

Return type:

str

get_sequence_context(self, position, offset=5, return_indices=False)

Function that allows a local region of the sequence centered on a specific position to be extracted, including +/- an offset border that intelligently truncates if the offset would extend outside the sequence region.

Parameters:
  • position (int) – Position for which we’ll interrogate the local sequence

  • offset (int (default = 5)) – Plus/Minus offset used to investigate the region around the position. Note that an offset is symmetrical around the position.

  • return_indices (bool (default = False)) – Flag which, if set to true, means this function returns a TUPLE where position 0 is the string corresponding to the region of interest, position 2 is the start index (in normal SHEPHARD indexing, i.e. starting from 1) and position 3 is the end index (in normal SHEPHARD indexing).

Returns:

If return_indices is set to False, this just returns a string that corresponds to the region of interest.

If return_indices is set to True, this just returns a string that corresponds to the region of interest, as well as the start and end positions that are inclusive in the sequence indexing from 1.

Return type:

str, (str, int, int)

check_sequence_is_valid(self)

Function that checks if the current protein sequence is valid (i.e. consists of only the standard 20 amino acids).

Returns:

Returns True if all residues are in the standard 20 amino acids, and False if not.

Return type:

bool

convert_to_valid(self, copy=False, safe=True)

Function that converts non-standard amino acid residues to standard ones and applies this version to the Protein’s sequence.

Specifically:

B -> N

U -> C

X -> G

Z -> Q

* -> <empty string>

- -> <empty string>

By default this alters the underlying sequence. If you wish to return a copy of the altered sequence instead set copy=True. Otherwise the underlying sequence is changed. Note that removing the * and - characters will change the sequence length which could cause major issues as none of the internal position-specific references will automatically update. Note that if safe=True such changes will trigger an exception.

Parameters:
  • copy (bool (default = False)) – Boolean flag - if set to true a copy of the updated sequence is returned, if False then the function returns None. In both cases the associated protein’s sequence is altered.

  • safe (bool (default = True)) – Boolean flag that defines how to respond if an update changes the sequence length. If set to true, a change that alters the sequence length will trigger an exception, if False it will continue unannounced.

Returns:

If copy = False then no return value is provided. If copy = True then the function returns a string.

Return type:

None, str

Attribute functions

attributes()

Provides a list of the keys associated with every attribute associated with this protein.

Returns:

returns a list of the attribute keys associated with the protein.

Return type:

list

attribute(self, name, safe=True)

Function that returns a specific attribute as defined by the name.

Recall that attributes are name : value pairs, where the ‘value’ can be anything and is user defined. This function will return the value associated with a given name.

Parameters:

name (str) – The attribute name. A list of valid names can be found by calling the <Protein>.attributes() (which returns a list of the valid names).

safebool (default = True)

Flag which if true with throw an exception if an attribute with the same name already exists

Returns:

Will either return whatever was associated with that attribute (which could be anything) or None if that attribute is missing.

Return type:

Unknown

add_attribute(self, name, val, safe=True)

Function that adds an attribute. Note that if safe is true, this function will raise an exception if the attribute is already present. If safe=False, then an existing value will be overwritten.

Parameters:
  • name (str) – The parameter name that will be used to identify it

  • val (<anything>) – An object or primitive we wish to associate with this attribute.

  • safe (bool (default = True)) – Flag which if True with throw an exception if an attribute with the same name already exists, otherwise the newly introduced attribute will overwrite the previous one.

Return type:

None - but adds an attribute to the calling object

remove_attribute(self, name, safe=True)

Function that removes a given attribute from the Protein based on the passed attribute name. If the passed attribute does not exist or is not associate with the Protein then this will trigger an exception unless safe=False.

Parameters:
  • name (str) – The parameter name that will be used to identify it

  • safe (bool (default = True)) – Flag which if True with throw an exception if an attribute this name does not exists. If set to False then if an attribute is not found it is simply ignored

Returns:

No return type but will remove an attribute from the protein if present.

Return type:

None

Domain functions

domains()

Returns a list of the Domain objects associated with this protein, sorted by first reside of the domain.

domain_names()

Returns a list of the domain names associated with this protein

domain(self, name, safe=True)

Function that returns a specific domain as defined by the name. Note it is often more useful to request a domain by type rather than by the name, in which case get_domains_by_type(<domain_type>) is the relevant syntax. Note domains can also be requested based on position (get_domains_by_position).

Parameters:
  • name (string) – The Domain name. A list of valid names can be found by calling the <Protein>.domains (which returns a list of the valid track names).

  • safe (bool (default = True)) – Flag which if true with throw an exception if no domain exists with this name. If false function will return None instead.

Returns:

Will either return the Domain object associated with the name, OR will return None if safe=False and there was no Domain object that matched the name.

Return type:

Unknown

domain_types()

Returns a list of the unique domain types associated with this protein. There will be no duplicates here.

add_domain(self, start, end, domain_type, attributes=None, safe=True, autoname=False)

Function that adds a domain, automatically generating a unique name if none is provided. Domain type can be used to assign a specific type if we want to retrieve domains of a specific type at some point. Position indexing is done for 1 - i.e. the first residue in a protein is 1, not 0.

Allows a domain at a specific position to be

Parameters:
  • start (int) – Position of the start of the domain, inclusive.

  • end (int) – Position of the end of the domain (not inclusive). i.e. if we had a domain that ran from start=10 end=20, it would be 10 residues long and include residues [10, 11, 12, 13, 14, 15, 16, 17, 18, 19].

  • domain_type (str) – None unique string that allows a type identifier to be associated with a domain.

  • attributes (dict (default = None)) – Optional dictionary which allows an arbitrary set of attributes to be associated with a domain, in much the same way that they can be associated with a protein.

  • safe (bool (default = True)) – If set to True over-writing tracks will raise an exception, otherwise overwriting a track will simply over-write it.

  • autoname (bool (default = False)) – If autoname is set to true, this function ensures each domain ALWAYS has a unique name - i.e. the allows for multiple domains to be perfectly overlapping in position and type. This is generally not going to be required and/or make sense, but having this feature in place is useful. In general we want to avoid this as it makes it easy to include duplicates which by default are prevented when autoname=False.

add_domains(self, list_of_domains, safe=True, autoname=False, verbose=False)

Function that takes a list of domain dictionaries and adds those domains to the protein.

Each domain dictionary within the list must have a key-value pair that defines the following info:

  • start - domain start position (in real sequence, not i0 indexing)

  • end - domain end position (in real sequence, not i0 indexing)

  • domain_type - type of the domain (string)

  • attributes - a dictionary of attributes to associated with the domain (optional)

Note that in start, end, and domain_type are the only required key-value pairs required in the dictionary.

If you wish to add many domains to main proteins, see interfaces.si_domains.add_domains_from_dictionary()

Parameters:
  • list_of_domains (list) –

    A list of domain dictionaries. A “domain dictionary” is defined above, but in short is a dictionary with the following key-value pairs:

    • REQUIRED:
      • start - int (domain start position)

      • end - int (domain end position)

      • domain_type - string (domain type)

    • OPTIONAL:
      • attributes - dictionary of arbitrary key-value pairs that will be associated with the domain

  • safe (bool (default = True)) – If set to True over-writing domains will raise an exception. If False, overwriting a domain will silently over-write.

  • autoname (bool (default = False)) – If autoname is set to true, this function ensures each domain ALWAYS has a unique name - i.e. the allows for multiple domains to be perfectly overlapping in position and type. This is generally not going to be required and/or make sense, but having this feature in place is useful. In general we want to avoid this as it makes it easy to include duplicates which by default are prevented when autoname=False.

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding domains.

Returns:

No return value, but will add the passed domains to the protein or throw an exception if something goes wrong!

Return type:

None

get_domains_by_type(self, domain_type, perfect_match=True)

Function that returns a list of domains as matched against a specific domain type name.

Parameters:
  • domain_type (string) – String associated domain_type that you want to search for.

  • perfect_match (bool (default = True)) – Flag that identifies if the domain names should be a perfect match (=True) or if the string passed should just appear somewhere in the domain_type .

Returns:

Returns a list of Domain objects that match the requested type. Objects are ordered by starting position in sequence.

Return type:

list

get_domains_by_position(self, position, wiggle=0)

Functions that allows all domains found at a position to be returned.

Wiggle defines +/- residues that are allowed (default = 0) in the search operation.

Parameters:
  • position (int) – Residue position of interest (position in sequence).

  • wiggle (int (default = 0)) – Value +/- the position (i.e. lets you look at sites around a specific position).

Returns:

Returns a list of Domain objects in the order they appear in the protein.

Return type:

list

get_domains_by_position_and_type(self, position, domain_type, wiggle=0)

Functions that allows all domains found at a position and of a specific type to be returned.

Wiggle defines +/- residues that are allowed (default = 0) in the search operation.

Parameters:
  • position (int) – Residue position of interest (position in sequence).

  • domain_type (str) – String used to match the against the domain types

  • wiggle (int (default = 0)) – Value +/- the position (i.e. lets you look at sites around a specific position).

Returns:

Returns a list of Domain objects in the order they appear in the protein.

Return type:

list

get_domains_by_range(self, start, end, wiggle=0, mode='overlap-strict')

Function that allows all domains in a protein that are found within a given range to be returned. Three possible modes can be used here; ‘internal’, ‘overlap-strict’ and ‘overlap’ (default = ‘overlap-strict’).

‘internal’ means that the range defined by start and end is 100% within the domains identified. For example, if a domain was between positions 50 and 100 then a range of 60 to 80 would identify that domain but a range of (say) 40 to 120 would not. This is the least permissive mode.

‘overlap-strict’ means that the range defined by start and end overlaps with the entire domain, but extra residues on at the start and the end of domain are not penalized. For example, if a domain was between positions 50 and 100 then a range of 40 to 120 would be identified because the domain fully overlaps. However a range of 40 to 70 would not. This is the second least permissive mmode, and all domains defined by ‘internal’ are also identified by overlap-strict.

‘overlap’ means that the range can also straddle domain boundaires. for example if a domain was between position 50 and 100 and the range was between 40 and 70 this would count - essentially this means any domains that overlap with the passed range in any way are included. This is the most permissive mode, and all domains identified by ‘internal’ and ‘overlap-strict’ are also identified by ‘overlap’.

Parameters:
  • start (int) – Start of region of interest (position in sequence)

  • end (int) – End of region of interest (position in sequence)

  • wiggle (int (default = 0)) – Value +/- at the edges that are included.

  • mode (str (default = 'overlap-strict')) – Selector that allows the mode to be used for domain overlap to be defined. Must be one of ‘internal’, ‘overlap-strict’, or ‘overlap’. Definitions and meaning described above.

Returns:

Returns a list of Domain objects in the order they appear in the protein.

Return type:

list

Track functions

tracks()

Provides a list of Track objects associated with this protein

Returns:

returns a list of the Tracks (order will be consistent but is not sorted).

Return type:

list

track(self, name, safe=True)

Function that returns a specific Track as defined by the name.

Recall that Tracks are defined by a name. If a Track by this name exists this function returns the actual Track object, NOT the values or symbols associated with the track. If a Track by this name does not exist then if safe=True an exception will be raised, otherwise the function returns None.

For direct access to values and symbols, use the <Protein>.get_track_values(<track_name>) and <Protein>.get_track_symbols(<track_name>).

Parameters:

name (str) – The track name. A list of valid names can be found by calling the <Protein>.tracks() (which returns a list of the valid track names).

Returns:

Will either return the Track object associated with the name, OR will return None if safe=False and there was no Track object that matched the name.

Return type:

Unknown

track_names()

Provides a list of the keys associated with each track associated with this protein.

These keys can then be used to extract a specific track, or can be used to check if a Track is present.

Returns:

returns a list of the track keys associated with the protein.

Return type:

list

get_track_values(self, name, start=None, end=None, safe=True)

Function that returns the values associated with a specific track, as defined by the name.

Recall that tracks are defined by a name. If a track by this name exists this function returns the values IF these are associated with the track. If no values are associated then the function will throw an exception unless safe is set to False, in which case it will return None.

Parameters:
  • name (string) – The track name. A list of valid names can be found by calling the <Protein>.tracks (which returns a list of the valid track names).

  • start (int (default None)) – If provided defines the start position along the track. If not provided defaults to 1 (first residue in the protein).

  • end (int (default None)) – If provided defines the end position along the track. If not provided defaults to the final residue in the protein.

  • safe (bool (default = True)) – Flag which if true with throw an exception if a track that matches the passed name does not already exist.

Returns:

Will either return the values associated with the track, OR will return None if safe=False and there was no Track that matched the name.

Return type:

Unknown

get_track_symbols(self, name, start=None, end=None, safe=True)

Function that returns the symbols associated with a specific track, as defined by the name.

Recall that tracks are defined by a name. If a track by this name exists this function returns the symbols IF these are associated with the track. If no symbols are associated then the function will throw an exception unless safe is set to False, in which case it will return None.

Parameters:
  • name (string) – The track name. A list of valid names can be found by calling the <Protein>.tracks (which returns a list of the valid track names).

  • start (int (default = None)) – If provided defines the start position along the track. If not provided defaults to 1 (first residue in the protein).

  • end (int (default = None)) – If provided defines the end position along the track. If not provided defaults to the final residue in the protein.

  • safe (bool (default = True)) – Flag which if true with throw an exception if a track that matches the passed name does not already exist.

Returns:

Will either return the values associated with the track, OR will return None if safe=False and there was no Track that matched the name.

Return type:

Unknown

add_track(self, name, values=None, symbols=None, safe=True)

Function that adds a track to this protein. For more information on Tracks see the relevant documentation. However, some general guidelines are provided below for convenience.

  • A values track should be a list/array of numerical values

  • A symbols track should be a list or string of symbolic characters

In either case, the iterable should have a 1:1 mapping with the sequence Finally, Tracks can have both a value and a symbol, although in general it probably makes sense to use multiple tracks.

Parameters:
  • name (string) – Name for track. NOTE that this is a unique identifier, and each track within a given protein should must have a unique name.

  • values (list or np.array (default None)) – A numerical iterable collection of values, where each value maps to a specific residue in the sequence.

  • symbols (list or string (default None)) – A symbolic collection of characters, where each symbol maps to a specific residue in the sequence.

  • safe (bool (default = True)) – If set to True over-writing tracks will raise an exception, otherwise overwriting a track will simply over-write it.

Returns:

Nothing, but adds a track to the calling object.

Return type:

None

remove_track(self, track_object, safe=True)

Function that removes a given Track from the Protein based on the passed Track object. If the passed Track does not exist or is not associate with the protein then this will trigger an exception unless safe=False.

Parameters:
  • track_object (shephard.track.Track Object or None) – Track Object that will be used to retrieve a given protein. Note that remove_track() can tollerate None as the object if Safe=False to enable a single for-loop to iterate over a proteome and remove all tracks of a specific type without worrying as to if the track is present or not.

  • safe (bool (default = True)) – Flag that if set to True means if a passed track is missing from the underlying protein object an exception wll be raised (ProteinException). If False a missing track is ignored.

Returns:

No return type but will remove track from the protein

Return type:

None

build_track(self, name, input_data, track_definition_function, safe=True)

Function that constructs a track using a given track_definition_function and a user provided input_data object. Very little constraint is set here, other than the fact the name should be a string and track_definition function should return a dictionary with (at least) two key:value pairings: symbols and values, where the corresponding value for each is bona-fide track input data.

Parameters:
  • name (string) – Name of the track to be used. Should be unique and will always overwrite an existing track with the same name (no safe keyword provided here).

  • input_data – Some kind of data that will be passed to the track_definition_function

  • track_definition_function (function) – Function that takes in input_data and returns a dictionary with a ‘values’ and a ‘symbols’ key and value pairing. The values that map to ‘values’ and ‘symbols’ will be added as a single new track defined by name.

  • safe (bool (default = True)) – If set to True over-writing tracks will raise an exception, otherwise overwriting a track will simply over-write it.

Returns:

No return type, but a new track is added to the Protein.

Return type:

None

build_track_values_from_sequence(self, name, trackfunction, input_dictionary=None, safe=True)

Tracks can be added as pre-loaded values. However, sometimes you want to build a track based on some analysis of the sequence on the fly. This function allows you to pass in your own function (with keyword arguments in the keywords dictionary) that will take in the protein sequence, generate a new track, and add that track to the protein.

build_track_values allows you to define a function that converts amino acid sequence into a numerical list or np.array, which gets written as a values track. If you want a symbols track, use build_track_symbols().

Specifically, the argument trackfunction must be a user-defined function. This function can be defined anywhere, but should take either one or two arguments:

  1. The first/only argument should be an amino acid sequence.

  2. The second argument a dictionary of key-value pairs.

When build_track_values_from_sequence is called, the sequence of the protein is passed as the first argument into the trackfunction, and - if present - the input_dictionary is passed as the second argument.

In this way a new track is defined internally, with the track function using the proteins sequence and any/all pass input_dictionary to convert the sequence into some numerical representation.

Parameters:
  • name (string) – Name of the track to be used. Should be unique and will always overwrite an existing track with the same name (no safe keyword provided here).

  • trackfunction (function) –

    A user define function that has the following properties:

    1. First argument is expected to be amino acid sequence

    2. Second argument (if provided) should be a dictionary which is passed (untouched) THROUGH build_track_values from sequence to the trackfunction at runtime

  • function_keywords (dictionary) –

    This is a dictionary that will be passed to the trackfunction as the second argument IF it is provided. In this way, the user can pass an arbitrarily complex set of arguments to the track function each time

    the build_track_values_from_sequence is called.

  • safe (bool (default = True)) – If set to True over-writing tracks will raise an exception, otherwise overwriting a track will simply over-write it.

Example

Below we offer an example for how one might defined a custom track-building function:

# define a function that takes in a sequence and converts it
# into some other numerical list. Note this is INLINE with the
# code, or could be elsewhere. This function MUST take either
# ONE argument (sequence) or TWO arguments (sequence and
# input_dictionary). Also the names of these arguments does
# not matter, but the order does (i.e. first argument will
# always get the sequence).

def trackbuilder(seq, input_dictionary):
    '''
        This function takes in a sequence (seq) as first argument,
        and the v1 and v2 as additional arguments. See below for
        what it's doing (pretty simple).

    '''
    newseq=[]

    # we are extracting out the 'values' from the input dictionary
    # for the sake of code clarity
    v1 = input_dictionary['v1']
    v2 = input_dictionary['v2']

    # for each residue in the sequence
    for i in seq:

        # is that residue in v1 (append 1) or v2 (append -1)? If
        # neither append 0
        if i in v1:
            newseq.append(1)
        elif i in v2:
            newseq.append(-1)
        else:
            newseq.append(0)

    return newseq

# define the input_dictionary (note again that the variable names
# here do not matter)
input_dictionary = {'v1':['K','R'], 'v2':['E','D']}

# now assuming ProtOb is a Protein object, this will add a new
# track
ProtOb.build_track_values('charge_vector', trackbuilder,
function_dictionary=input_dictionary)

In this example we defined a function that converts an amino acid string into a numerical list where positively charged residues = +1 and negatively charged residues = -1. We applied this function to generate a ‘charge_vector’ track.

Note this is analagous to defining our function and then running:

s = ProtOb.sequence
newtrack = trackbuilder(s, ['K','R'], ['E',D'])
ProbOb.add_track('charge_vector', values=newtrack)

Some FAQs:

  • Do I need to pass an input_dictionary to the custom function? No!

  • Does the name of the custom function matter? No!

  • Does the custom function have to accepted the amino acid sequence as the first argument? Yes!

build_track_symbols_from_sequence(self, name, trackfunction, input_dictionary=None, safe=True)

Tracks can be added as pre-loaded values. However, sometimes you want to build a track based on some analysis of the sequence on the fly. This function allows you to pass in your own function (with keyword arguments) that will take in the protein sequence, generate a new track, and add that track to the Protein.

build_track_symbols allows you to define a function that converts amino acid sequence into a symbolic list or string, which gets written as a symbols track. If you want a values track, use build_track_values().

Specifically, the argument trackfunction must be a user-defined function. This function can be defined anywhere, but should take either one or two arguments:

  1. The first/only argument should be an amino acid sequence.

  2. The second argument a dictionary of key-value pairs.

When build_track_symbols_from_sequence is called, the sequence of the protein is passed as the first argument into the trackfunction, and - if present - the input_dictionary is passed as the second argument.

In this way a new track is defined internally, with the track function using the proteins sequence and any/all pass input_dictionary to convert the sequence into some other symbolic representation.

Parameters:
  • name (string) – Name of the track to be used. Should be unique and will always overwrite an existing track with the same name (no safe keyword provided here).

  • trackfunction (funct) –

    A user define function that has the following properties:

    1. First argument is expected to be amino acid sequence

    2. Second argument (if provided) should be a dictionary which is passed (untouched) THROUGH build_track_values from sequence to the trackfunction at runtime

  • function_keywords (dict (default None)) – This is a dictionary that will be passed to the trackfunction as the second argument IF it is provided. In this way, the user can pass an arbitrarily complex set of arguments to the trackfunction each time the build_track_symbols_from_sequence is called.

  • safe (bool (default = True)) – If set to True over-writing tracks will raise an exception, otherwise overwriting a track will simply over-write it.

Example

Below we offer an example for how one might defined a custom track-building function:

# define a function that takes in a sequence and converts it into some
# other symbolic representation as a string. Note this is INLINE with
# the code, or could be elsewhere. This function MUST take either ONE
# argument (sequence) or TWO arguments (sequence and input_dictionary).
#
# Also the names of these arguments does not matter, but the order does
# (i.e. first argument will always get the sequence).

def trackbuilder(seq, input_dictionary):
    '''
        This function takes in a sequence (seq) as first argument,
        and the v1 and v2 as additional arguments. See below for what
        it's doing (pretty simple).
    '''
    new_string_list=[]

    # we are extracting out the 'values' from the input dictionary
    # for the sake of code clarity
    v1 = input_dictionary['v1']
    v2 = input_dictionary['v2']

    # for each residue in the sequence
    for i in seq:

        # is that residue in v1 (append 1) or v2 (append -1)? If neither
        # append 0
        if i in v1:
            new_string_list.append('+')
        elif i in v2:
            new_string_list.append('-')
        else:
            new_string_list.append('0')

    # convert the list into a string
    newstring = "".join(new_string_list)
    return newstring

# define the input_dictionary (note again that the variable names
# here do not matter)
input_dictionary = {'v1':['K','R'], 'v2':['E','D']}

# now assuming ProtOb is a Protein object, this will add a new track
ProtOb.build_track_values('charge_string', trackbuilder,
                           function_dictionary=input_dictionary)

In this example we defined a function that converts an amino acid string into a coarse-grained string representation where positive residues are “+”, negative are “-” and neutral are “0”.

Note this is analagous to defining our function and then running:

s = ProtOb.sequence
newtrack = trackbuilder(s, ['K','R'], ['E',D'])
ProbOb.add_track('charge_vector', values=newtrack)

FAQs:

  • Do I need to pass an input_dictionary to the custom function? No

  • Does the name of the custom function matter? No!

  • Does the custom function have to accepted the amino acid sequence as the first argument? Yes!

Site functions

sites()

Provides a list of the sites associated with every site on the protein. Sorted N to C terminal.

site(self, position, safe=True)

Returns the list of sites that are found at a given position. Note that - in general site() should be used to retrieve sites you know exist while get_sites_by_position() offers a way to more safely get sites at a position. Site will throw an exception if the position passed does not exist (while get_sites_by_position() will not).

Parameters:

position (int) – Defines the position in the sequence we want to interrogate

Returns:

Returns a list with between 0 and n sites. Will raise an exception if the passed position cannot be found in the codebase unless safe=False, in which case an empty list is returned.

Return type:

list

site_types()

Returns a list of the unique site types associated with this protein. There will be no duplicates here.

site_positions()

Provides a list of the sorted positions where

a site is found the protein. Sorted N to C terminal.

add_site(self, position, site_type, symbol=None, value=None, attributes=None)

Function that adds a site to a specific position in the sequence. Sites are indexed by residue position, and multiple sites can co-exist on the same site, so no name is required (unlike Proteins, Tracks or Domains).

site_type is a non-unique identifier that allows sites to be specifically identified/selected.

Sites can be associated with a numerical value, a symbol, or both. Sites can also have attributes associated with them.

If you wish to add many sites to many proteins, see:

interfaces.si_sites.add_sites_from_dictionary()

Parameters:
  • position (int) – Position of site (recall we index from 1 - i.e. the first residue in a protein = 1, not 0. Note that this value is cast to int.

  • site_type (string) – Non-unique string that allows a type identifier to be associated with a site.

  • symbol (string (default = None)) – Symbol associated with a site. Symbols are string-based - will often be a single character but could be multiple characters.

  • value (float64 (default = None)) – Numerical value associated with a site. Note that the value is cast to a float64.

  • attributes (dict (default = None)) – Optional dictionary which allows an arbitrary set of attributes to be associated with a domain, in much the same way that they can be associated with a protein.

remove_site(self, site_object, safe=True)

Function that removes a given site from the protein based on the passed site object. If the passed site does not exist or is not associate with the protein then this will trigger an exception unless safe=False.

Parameters:
  • site (Site Object) – Unique ID that will be used to retrieve a given protein. Note that remove_site() can tollerate None as the site_object if Safe=False to enable a single for-loop to iterate over a proteome and remove all sites of a specific type without worrying as to if the site is present or not.

  • safe (bool) – Flag that if set to True means if a passed unique_ID is missing from the underlying proteome object an exception wll be raised (ProteomeException). If False a missing unique_ID is ignored.

Returns:

No return type but will remove site from the protein

Return type:

None

get_sites_by_position(self, position, wiggle=0, return_list=False)

Get all sites at a specific position

Parameters:
  • position (int) – Residue position of interest (position in sequence)

  • wiggle (int (default = 0)) – Value +/- the position (i.e. lets you look at sites around a specific position)

  • return_list (bool) – By default, the flag returns a dictionary, which is conveninet as it makes it easy to index into one or more sites at a specific position in the sequence. However, you may instead want a list of sites, in which case setting return_list will have the function simply return a list of sites. As of right now we do not guarentee the order of these returned sites.

Returns:

  • dict – Returns a dictionary where the key is a position (location) and the value is a list of one or more sites at that position.

  • list – If return_list is set to True, then a list of Site objects is returned instead.

get_sites_by_range(self, start, end, wiggle=0, return_list=False)

Get all sites within a certain range.

Parameters:
  • start (int) – Start of region of interest (position in sequence)

  • end (int) – End of region of interest (position in sequence)

  • wiggle (int (default = 0)) – Value +/- at the edges that are included.

  • return_list (bool) – By default, the flag returns a dictionary, which is conveninet as it makes it easy to index into one or more sites at a specific position in the sequence. However, you may instead want a list of sites, in which case setting return_list will have the function simply return a list of sites. As of right now we do not guarentee the order of these returned sites.

Returns:

  • dict – Returns a dictionary where the key is a position (location) and the value is a list of one or more sites at that position.

  • list – If return_list is set to True, then a list of Site objects is returned instead.

get_sites_by_type(self, site_types, return_list=False)

Get a set of sites that match a specified site-type.

Parameters:
  • site_types (string or list of strings) – One or more possible site_types that may be found in the protein. Either a single string or a list of strings can be passed, allowing for one or more sites to be grouped together

  • return_list (bool) – By default, the flag returns a dictionary, which is conveninet as it makes it easy to index into one or more sites at a specific position in the sequence. However, you may instead want a list of sites, in which case setting return_list will have the function simply return a list of sites. As of right now we do not guarentee the order of these returned sites.

Returns:

  • dict – Returns a dictionary where the key is a position (location) and the value is a list of one or more sites at that position that match the site type of interest.

  • list – If return_list is set to True, then a list of Site objects is returned instead.

get_sites_by_type_and_range(self, site_types, start, end, wiggle=0, return_list=False)

Returns a set of sites that match both a type of interest and are found in the range provided.

Parameters:
  • site_types (string or list of strings) – One or more possible site_types that may be found in the protein. Either a single string or a list of strings can be passed, allowing for one or more sites to be grouped together.

  • start (int) – Start residue that defines start of region to be examined

  • end (int) – End reidue that defines end of region to be examined

  • wiggle (int (default = 0)) – Value that adds slack to the start/end positions symmetrically around the start and end positions.

  • return_list (bool) – By default, the flag returns a dictionary, which is conveninet as it makes it easy to index into one or more sites at a specific position in the sequence. However, you may instead want a list of sites, in which case setting return_list will have the function simply return a list of sites. As of right now we do not guarentee the order of these returned sites.

Returns:

  • dict – Returns a dictionary where the key is a position (location) and the value is a list of one or more sites at that position that match the site type of interest.

  • list – If return_list is set to True, then a list of Site objects is returned instead.