Proteome

class Proteome(input_list=None, attributes=None, force_overwrite=False)[source]

The Proteome object is the main unit for information storage in SHEPHARD.

There are a few ways that new Proteomes can be generated:

  • By reading in a FASTA file (using shephard.interfaces.apis.fasta)

The Proteome constructor takes a single argument, which is a list of protein dictionaries or a list of Protein objects. This means Proteome objects can be generated directly (see below for a definition of protein dictionaries). However, it is often more convenient to build Proteomes from FASTA files. For more information on this see the api documentation.

Protein dictionaries are dictionaries that must contain four elements (others are ignored).

  • sequence : str - Amino acid sequence of the protein.

  • name : str - Name of the protein (this can be anything, it is not used internally so no constraints on what this is.

  • unique_ID : str - This must be unique with respect to all other unique_IDs in the set of proteins in the input list.

  • attributes : dict - Dictionary of one or more attributes to apply to this protein. Key/value pairs in this dictionary can be arbitrary and are user defined.

As an example:

>>> protein_dictionary_example = {'sequence':'ALAPSLLPAMPALSPALSP',
                                  'name': 'my protein fragment',
                                  'unique_ID':'UXX01', 'attributes':{}}
>>> dictionary_list = []
>>> dictionary_list.append(protein_dictionary_example)
>>> P = Proteome(dictionary_list)

Note that sequence, name and unique_ID are cast to str by the function, so if numerical values are passed for any these will be converted to strings.

Notes

  • NOTE that ALL FOUR of these are required for EACH protein, even if the attributes dictionary is empty.

  • The unique_ID is checked for uniqueness against all others in the Proteomes and will throw and exception if it is, in fact, not unique.

  • Additional proteins can be added using the .add_protein() or `.add_proteins() function.

Proteome functions

check_unique_ID(self, unique_id)

Function that checks if a given unique ID is found. Note that this function is not needed for testing if a unique_ID is present if the goal is to request Protein Objects (or not). Instead, one can use the .protein(<unique_ID>, safe=False). By setting safe=False if the unique_ID is not found then this function will simply return None.

Parameters:

unique_id (string) – String corresponding to a unique_ID associated with some protein

Returns:

Returns True if the passed ID is present, or False if not.

Return type:

bool

attributes()

Provides a list of the keys associated with every attribute associated with this protein.

Returns:

returns a list of the attribute keys associated with the Proteome.

Return type:

list

attribute(self, name, safe=True)

Function that returns a specific attribute as defined by the name.

Recall that attributes are name : value pairs, where the ‘value’ can be anything and is user defined. This function will return the value associated with a given name.

Parameters:
  • name (str) – The attribute name. A list of valid names can be found by calling the <Proteome>.attributes() (which returns a list of the valid names).

  • safe (bool (default = True)) – Flag which if true with throw an exception if an attribute with the same name already exists.

Returns:

Will either return whatever was associated with that attribute (which could be anything) or None if that attribute is missing.

Return type:

Unknown

add_attribute(self, name, val, safe=True)

Function that adds an attribute. Note that if safe is true, this function will raise an exception if the attribute is already present. If safe=False, then an exisiting value will be overwritten.

Parameters:
  • name (str) – The parameter name that will be used to identify it

  • val (<anything>) – An object or primitive we wish to associate with this attribute

  • safe (bool (default = True)) – Flag which if True with throw an exception if an attribute with the same name already exists, otherwise the newly introduced attribute will overwrite the previous one.

Return type:

None - but adds an attribute to the calling object

remove_attribute(self, name, safe=True)

Function that removes a given attribute from the Proteome based on the passed attribute name. If the passed attribute does not exist or is not associate with the protein then this will trigger an exception unless safe=False.

Parameters:
  • name (str) – The parameter name that will be used to identify it

  • safe (bool (default = True)) – Flag which if True with throw an exception if an attribute this name does not exists. If set to False then if an attribute is not found it is simply ignored

Returns:

No return type but will remove an attribute from the Proteome if present.

Return type:

None

__iter__(self)

Allows a Proteome object to act as a generator that yields actual proteins, so the syntax

for protein in ProteomeObject:
    print(protein.sequence)

is be valid and would iterate through the proteins in the Proteome.

This makes performing some analysis over all proteins quite easy.

__contains__(self, m)

Enables the syntax X in Proteome to be used, where X can be either a unique ID or a Proteome object.

if protein.unique_ID in ProteomeObject:
    print(f'The protein {protein} is in the Proteome!')
__getitem__(self, key)

Allows slicing index into Proteome to retrieve subsets of protein

first_protein = ProteomeObject[0]
print(f'The first protein is {first_protein}')
__len__(self)

The length of the Proteome is defined as the number of proteins in it.

Returns:

Returns an integer that reflects the number of proteins

Return type:

int

Protein functions

proteins()

Returns a list of unique_IDs that correspond to the proteins in this Proteome. NOTE this returns a list of the IDs, not the actual Protein objects. To get the corresponding protein object one must use the .protein(<unique_ID>) notation.

Returns:

Returns a list of unique IDs

Return type:

list of str

protein(self, unique_ID, safe=True)

Returns the Protein object associated with the passed unique_ID. If there is no Protein associated with the provided unique_ID then if safe=True (default) an exception is raised, while if safe=False then None is returned.

Parameters:
  • unique_id (string) – String corresponding to a unique_ID associated with some protein

  • safe (bool (default = True)) – If set to True then a missing unique_ID will raise an exception. If False then a missing unique_ID will simply return None

Returns:

Depending on if the passed unique_ID is found in the Proteome, a Protein object or None will be returned.

Return type:

Protein Object, None

add_protein(self, sequence, name, unique_ID, attributes=None, force_overwrite=False)

Function that allows the user to add a new protein to a Proteomes in an ad-hoc fashion. In general most of the time it will make sense to add proteins all at once from some input source, but the ability to add proteins one at a time is also useful.

If a duplicate unique_ID is passed an exception (ProteomeException) is raised.

Parameters:
  • sequence (string) – Amino acid sequence of the protein. Note - no sanity check of the sequence is performed.

  • name (string) – String reflecting the protein name. Again this can be anything.

  • unique_id (string) – String corresponding to a unique_ID associated with some protein.

  • attributes (dict (default = None)) – The attributes dictionary provides a key-value pairing for arbitrary information. This could include gene names, different types of identifies, protein copy number, a set of protein partners, or anything else one might wish to associated with the protein as a whole. Default is None.

  • force_overwrite (Bool (default = False)) – If set to False and a unique_ID is included that already is found then this function will raise an exception. However, if set to True it will automatically overwrite the pre-existing entry. (Default = False).

Returns:

No return status, but valid proteins included in the input_list will be added to to the underlying proteome.

Return type:

None

add_proteins(self, input_list, force_overwrite=False)

Function that allows the user to add a multiple new proteins using either a list of protein dictionaries (described below) or a list of Protein objects.

Protein dictionaries

One mode of adding multiple proteins is by passing a list of Protein dictionaries.

Protein dictionaries are dictionaries that posses the following key-value pairs

'sequence'   : amino acid sequence (str)
'name'       : protein name (str)
'unique_ID'  : The unique identification number used for the
               protein (str)
'attributes' : A dictionary of arbitrary key-value pairs to
               associate with the protein (dict or None)

Additional keys/value pairs are ignored and ALL four of these must be included. If any are missing for any protein entry this function raises a ProteomeException.

Protein objects A second mode of adding multiple proteins is by passing a list of Protein objects

In both cases, the function automatically determines the type of the passed list, and adds dictionaries accordingly. Note that in both cases proteins are added by value - i.e. a new Protein object is generated.

Parameters:
  • input_list (list) – List of Protein dictionaries or list of Protein objects

  • force_overwrite (bool (default = False)) – If set to False and a unique_ID is included that already is found then this function will raise an exception. However, if set to True it will automatically overwrite the pre-existing entry.

Returns:

No return status, but valid proteins included in the input_list will be added to to the underlying proteome.

Return type:

None

remove_protein(self, unique_ID, safe=True)

Function that removes a given protein from the Proteome based on the passed unique_ID. If the passed unique_ID does not exist then this will trigger an exception unless safe=False.

Parameters:
  • unique_ID (str) – Unique ID that will be used to retrieve a given protein

  • safe (bool (default = True)) – Flag that if set to True means if a passed unique_ID is missing from the underlying proteome object an exception wll be raised (ProteomeException). If set to False, a missing unique_ID is ignored.

Returns:

No return type but will remove an entry from the Proteome.

Return type:

None

remove_proteins(self, input_list, safe=True)

Function that removes a given proteome from the Proteome based on the passed unique_ID. If the passed unique_ID does not exist then this will trigger an exception unless safe = False.

Parameters:
  • input_list (list of str) – List that contains the unique IDs that will be used to select proteins for deletion.

  • safe (bool (default = True)) – Flag that if set to True means if a passed unique_ID is missing from the underlying proteome object an exception wll be raised (ProteomeException). If False a missing unique_ID is ignored.

Returns:

No return type but will remove an entry from the proteome

Return type:

None

Domain properties

domains()

Function that returns a list of all domain objects associated with the Proteome.

This function is useful if you wish to indiscriminately ask questions of domains without considering the proteins they come from. However, each Domain has a Protein object associated with it (via the .protein operator), so one can always map a Domain back to a Protein.

Returns:

A list of all the Domains from every protein in the Proteome

Return type:

list of Domains

unique_domain_types()

Returns the list of unique Domain types associated with this Proteome.

Returns:

Each element in the list is a string that corresponds to a Domain type.

Return type:

list of str

get_domains_by_type(self, domain_type, perfect_match=True)

Function that returns a list of domains from all proteins that matched against a specific domain type name.

Parameters:
  • domain_type (string) – String associated domain_type that you want to search for.

  • perfect_match (bool (default = True)) – Flag that identifies if the domain names should be a perfect match (=True) or if the string passed should just appear somewhere in the domain_type string

Returns:

Returns a list of Domain objects that match the requested type. Objects are ordered by starting position in sequence.

Return type:

list

Site properties

sites()

Function that returns a list of all Site objects associated with the Proteome.

This function is useful if you wish to indiscriminately ask questions of sites without considering the proteins they come from. However, each Site has a Protein object associated with it (via .protein operator), so one can always map a Site back to a Protein.

Returns:

A list of all the Sites from every protein in the Proteome

Return type:

list of Sites

unique_site_types()

Returns the list of unique Site types associated with this Proteome.

Returns:

Each element in the list is a string that corresponds to a Site type

Return type:

list of str

get_sites_by_type(self, site_types)

Function that returns a list of sites from all proteins that matched against a specific site type name or set of site type names.

Parameters:

site_types (string or list of strings) – One or more possible site_types that may be found in the protein. Either a single string or a list of strings can be passed, allowing for one or more sites to be grouped together

Returns:

Returns a list of Domain objects that match the requested type. Objects are ordered by starting position in sequence.

Return type:

list

Track properties

unique_track_names()

Returns the list of unique Track names associated with this Proteome.

Returns:

Each element in the list is a string that corresponds to a Track name found in one (or more) proteins

Return type:

list of strings

track_names_to_track_type()

Returns a (copy of a) dictionry that maps track name to track type. We return a copy so there’s no way we can accidentally break the internal book-keeping of the Proteome object.

Returns:

A dictionary that contains the unique track names and maps each name to either values or type.

Return type:

dict