Interfaces

Overview

Interfaces define functions that enable the reading or writing of data into or out of SHEPHARD proteomes. In particular, these functions operate by taking a Proteome object and then either annotating the proteins in that Proteome object with Tracks, Domains, Sites, or Attributes, OR writing Tracks, Domains, Sites or Attributes to file.

Example

By way of a simple example, here we use the Domains interface package.

from shephard.apis import fasta
from shephard import interfaces

# create a new Proteome from a FASTA file
small_proteom = fasta.fasta_to_proteome('sequences.fasta')

# use the interfaces package to annotate the Proteome object with the domains
# in the file `DNA_binding_domains.tsv`
interfaces.si_domains.add_domains_from_file(small_proteom, 'DNA_binding_domains.tsv')

By using interfaces, we ensure that, as long as you can write your data in a format that complies with a Track, Site, Domain, or Protein_attribute file, you can be sure it will be correctly read into SHEPHARD and is then accessible within the larger framework.

si_sites

Functions associated with the si_sites module enable the reading and writing of SHEPHARD Sites files.

add_sites_from_file(proteome, filename, delimiter='\t', return_dictionary=False, safe=True, skip_bad=True, verbose=True)[source]

Function that provides the user-facing interface for reading correctly configured SHEPHARD sites files and adding those sites to the proteins of interest.

A SHEPHARD sites file is a tab (or other) delineated file where each line has the following convention:

   1        2          3       4      5   [      6            7        ...     n         ]
Unique_ID position site_type symbol value [key_1:value_1 key_2:value_2 ... key_n:value_n ]

Each line has six required values and then can have as many key:value pairs as may be desired.

Parameters:
  • proteome (Proteome) – Proteome object to which we’re adding sites. Note that ONLY sites for which a protein is found will be used. Protein-Site cross-referencing is done using the protein’s unique_ID which should be the key used in the sites_dictionary

  • filename (str) – Name of the shephard site file to be read

  • delimiter (str (default = '\t')) – String used as a delimiter on the input file.

  • return_dictionary (bool, default=False) – If set to true, this function will return the sites dictionary and will NOT add that dictionary to the proteome - i.e. the function basically becomes a parser for SHEPHARD-compliant sites files.

  • safe (bool (default = True)) – If set to True then any exceptions raised during the site-adding process (i.e. after file parsing) are acted on. If set to False, exceptions simply mean the site in question is skipped. There are various reasons site addition could fail (e.g. site falls outside of protein position so if verbose=True then the cause of an exception is also printed to screen. It is highly recommend that if you choose to use safe=False you also set verbose=True. Default = True.

  • skip_bad (bool (default = True)) – Flag that means if bad lines (lines that trigger an exception) are encountered the code will just skip them. By default this is true, which adds a certain robustness to file parsing, but could also hide errors. Note that if lines are skipped a warning will be printed (regardless of verbose flag).

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding sites.

Returns:

If return_dictionary is set to False (default) then this function has no return value, but the sites are added to the Proteome object passed as the first argument. If return_dictionary is set to True the function returns the parsed sites dictionary without adding the newly-read sites to the proteome.

Return type:

None or dict

add_sites_from_dictionary(proteome, sites_dictionary, safe=True, verbose=False)[source]

Function that takes a correctly formatted Sites dictionary and will add those Sites to the proteins in the Proteome.

Sites dictionaries are key-value pairs, where the key is a unique_ID associated with a given Protein, and the value is a list of dictionaries. Each subdirectionay has the following elements:

'position'   = site position
'site_type'  = site type
'symbol'     = site symbol
'value'      = site value
'attributes' = site attribute dictionary

In this way, each site that maps to a give unique_ID will be added to the associated protein. The use of a list of dictionaries (as opposed to a simple unique_ID:site_dictionary pairing) means multiple sites for a single protein can be added at once.

Parameters:
  • proteome (Proteome) – Proteome object to which we’re adding sites. Note that ONLY sites for which a protein is found will be used. Protein:Site cross-referencing is done using the protein’s unique_ID which should be the key used in the sites_dictionary

  • sites_dictionary (dict) –

    A sites dictionary (defined above) is dictionary that maps a unique_ID back to a list of dictionaries, where each subdictionay has five elements, desribed above.

    Recall the only type-specific values (position and value) are cast automatically when a site is added by the Protein object, so there is no need to do that in this function too.

    Extra key-value paris in each sub-dictionary are ignored

  • safe (bool (default = True)) – If set to True then any exceptions raised during the site-adding process are acted on. If set to false, exceptions simply mean the site in question is skipped. There are various reasons site addition could fail (notably position of the site is outside of the protein limits) and so if verbose=True then the cause of an exception is also printed to screen. It is highly recommend that if you choose to use safe=False you also set verbose=True

  • verbose (bool (default = False)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding sites.

Returns:

No return value, but adds all of the passed sites to the protein

Return type:

None

write_sites(proteome, filename, delimiter='\t', site_types=None)[source]

Function that writes out sites to file in a standardized format. Note that attributes are converted to a string, which for simple attributes is reasonable but is not really a viable stratergy for complex objects, although this will not yeild and error.

If a site_types list is provided, only site_types that match to strings in this list are written out.

Parameters:
  • proteome (Proteome) – Proteome object from which the sites will be extracted from

  • filename (str) – Filename that will be used to write the new sites file

  • site_type (str (default = None)) – If provided, this is an identifier that allows you to specificy a specific site type to write out.

  • delimiter (str (default = '\t')) – Character (or characters) used to separate between fields. Default is the tab character (’\t’), which is recommended to maintain compliance with default SHEPHARD file-reading functions.

Returns:

No return type, but generates a new file with the complete set of sites from this proteome written to disk.

Return type:

None

write_sites_from_list(site_list, filename, delimiter='\t')[source]

Function that writes out sites to a SHEPHARD sites file from a list of Site objects. Note that attributes are converted to a string, which for simple attributes is reasonable but is not really a viable stratergy for complex objects, although this will not yeild and error.

Parameters:
  • site_list (List of Site objects) – List of site objects which will be written

  • filename (str) – Filename that will be used to write the new sites file

  • delimiter (str (default = '\t')) – Character (or characters) used to separate between fields. Default is ‘\t’ which is recommended to maintain compliance with default add_sites_from_file() function

Returns:

No return type, but generates a new file with the complete set of sites from this proteome written to disk.

Return type:

None

si_domains

Functions associated with the si_domains module enable the reading and writing of SHEPHARD Domains files.

add_domains_from_file(proteome, filename, delimiter='\t', autoname=False, return_dictionary=False, safe=True, skip_bad=True, verbose=True)[source]

Function that takes a correctly formatted shephard ‘domains’ file and reads all domains into the passed Proteome.

Expect Domain files to have the following format:

One domain per line where with the format:

    1       2    3       4            5            6                   n
Unique_ID start stop domain_type key_1:value_1 key_2:value_2 ... key_n:value_n

A couple of key points here:

  • The default delimiter is tabs (’\t’) but this can be changed with the delimiter argument.

  • The first four elements in the each line are required, while all of the key:value pairs are optional

  • Attribute key-value pairs must be separated by a : character. As a result any column delimiter (other than :) can be used, but : is reserved for this role

Parameters:
  • proteome (shephard.proteome.Proteome) – Proteome object to which domains will be added

  • filename (str) – Name of the shephard domains file to read

  • delimiter (str (default = '\t')) – String used as a delimiter on the input file.

  • autoname (bool (default = False)) – If autoname is set to True, this function ensures each domain ALWAYS has a unique name - i.e. the allows for multiple domains to be perfectly overlapping in position and type. This is generally not going to be required and/or make sense, but having this feature in place is useful. In general we want to avoid this as it makes it easy to include duplicates which by default are prevented when autoname = False.

  • return_dictionary (bool, default=False) – If set to true, this function will return the domains dictionary and will NOT add that dictionary to the proteome - i.e. the function basically becomes a parser for SHEPHARD-compliant domains files.

  • safe (bool (default = True)) – If set to True then any exceptions raised during the domain-adding process (i.e. after file parsing) are acted on. If set to false, exceptions simply mean the domain in question is skipped. Note if set to False, pre-existing domains with the same name would be silently overwritten (although this is not consider an error), while overwriting will trigger an exception in safe=True. There are various reasons domain addition could fail (start/end position outside of the protein limits etc) and so if verbose=True then the cause of an exception is also printed to screen. It is highly recommend that if you choose to use safe=False you also set verbose=True.

  • skip_bad (bool (default = True)) – Flag that means if bad lines (lines that trigger an exception) are encountered the code will just skip them. By default this is true, which adds a certain robustness to file parsing, but could also hide errors. Note that if lines are skipped a warning will be printed (regardless of verbose flag).

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding domains.

Returns:

If return_dictionary is set to False (default) then this function has no return value, but the domains are added to the Proteome object passed as the first argument. If return_dictionary is set to True the function returns the parsed domains dictionary without adding the newly-read domains to the proteome.

Return type:

None or dict

add_domains_from_dictionary(proteome, domain_dictionary, autoname=False, safe=True, verbose=True)[source]

Function that takes a correctly formatted Domains dictionary and will add those domains to the proteins in the Proteome.

Domains dictionaries are key-value pairs, where the key is a unique_ID associated with a given protein, and the value is a list of dictionaries. Each subdictionary has four key-value pairs:

* 'start' = start position (int showing start of the domain, starting at 1)

* 'end' = end position (int showing end of the domain, inclusive)

* 'domain_type' = domain type (string that names the domain)

* 'attributes' = dictionary of arbitrary key:value pairings (optional)

The start and end positions should be locations within the sequence defined by the unique_ID, and if they are out of the sequence bounds this will throw an exception. Domain type is a string that names the type of domain. The attributes dictionary is an arbitrary key-value pair dictionary where key-values map an arbitrary key to an arbitrary value (read in as strings).

In this way, each domain that maps to a give unique_ID will be added. Note the attribute is optional.

Parameters:
  • proteome (Proteome object) – Proteome object to which domains will be added

  • domain_dictionary (dict) – Dictionary that maps unique_IDs to a list of one or more domain dictionaries

  • autoname (bool (default = False)) – If autoname is set to true, this function ensures each domain ALWAYS has a unique name - i.e. the allows for multiple domains to be perfecly overlapping in position and type. This is generally not going to be required and/or make sense, but having this feature in place is useful. In general we want to avoid this as it makes it easy to include duplicates which by default are prevented when autoname = False.

  • safe (bool (default = True)) – If set to True then any exceptions raised during the Domain-adding process are acted on. If set to False, exceptions simply mean the domain in question is skipped. Note if set to False, pre-existing Domains with the same name would be silently overwritten (although this is not consider an error), while overwriting will trigger an exception in safe=True There are various reasons Domain addition could fail (start/end position outside of the protein limits etc.) and so if verbose=True then the cause of an exception is also printed to screen. It is highly recommend that if you choose to use safe=False you also set verbose=True.

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding domains.

Returns:

No return value, but domains are added to the Proteome object passed as the first argument.

Return type:

None

add_domain_attributes_from_file(proteome, filename, delimiter='\t', safe=True, add_new=True, skip_bad=True, verbose=True)[source]

Function that takes a correctly formatted ‘domain’ files and reads all domain attributes adding them to domains in the passed proteome, if new domains are inclued the add_new flag determins if new domains are added.

The function expects domain attribute files to have the following format:

One domain defined per line (although the same protein can appear multiple times):

Unique_ID,  domain_name, key1:value1, key2:value2, ..., keyn:valuen

A couple of key points here:

  • The default delimiter is tabs (’\t’) but this can be changed with the delimiter argument

  • Key value must be separated by a ‘:’, as a result, any delimiter (other than ‘:’) can be used, but ‘:’ is reserved for this role.

Parameters:
  • proteome (Proteome Object) – Proteome object to which attributes will be added

  • filename (str) – Name of the shephard protein attributes file to read

  • delimiter (str (default = 't')) – String used as a delimiter on the input file.

  • add_new (boolean (default = True)) –

    If set to True then any new found domains are added to their associated protein. If False any unfound domains are not added and are skipped over.

    If a new domain is passed that does not have an associated protein in the passed proteome an exception will always be raised regardless of the status of this parameter.

  • safe (bool (default = True)) –

    If set to True then any exceptions raised during the protein_attribute-adding process are acted on. If set to False, exceptions simply mean the protein_attribute in question is skipped. Note if set to False, pre-existing protein_attributes with the same name would be silently overwritten (although this is not consider an error), while overwriting will trigger an exception in safe=True.

    The only reason protein attribute addition could fail is if the attribute already exists, so this is effectively a flag to define if pre-existing attributes should be overwritten (False) or not (True).

  • skip_bad (bool (default = True)) – Flag that means if bad lines (lines that trigger an exception) are encountered the code will just skip them. By default this is true, which adds a certain robustness to file parsing, but could also hide errors. Note that if lines are skipped a warning will be printed (regardless of verbose flag).

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding attributes.

Returns:

If return_dictionary is set to False (default) then this function has no return value, but the protein_attributes are added to the Proteome object passed as the first argument. If return_dictionary is set to True the function returns the parsed domains_dictionary without adding the newly-read protein_attributes to the proteome.

Return type:

None or dict

add_domain_attributes_from_dictionary(proteome, domain_dictionary, add_new=True, safe=True, verbose=True)[source]

Function that takes a correctly formatted Domains dictionary and will add those associated attributes domains to the proteins in the Proteome.

Domains dictionaries are key-value pairs, where the key is a unique_ID associated with a given protein, and the value is a list of dictionaries. Each subdictionary has four key-value pairs:

  • ‘protein’ the unique_ID of the protein for which to domain is associated with

  • ‘domain_name’ = domain type (string that names the domain)

  • ‘attributes’ = dictionary of arbitrary key:value pairings (optional)

The start and end positions should be locations within the sequence defined by the unique_ID, and if they are out of the sequence bounds this will throw an exception. Domain type is a string that names the type of domain. The attributes dictionary is an arbitrary key-value pair dictionary where key-values map an arbitrary key to an arbitrary value (read in as strings).

In this way, each domain that maps to a give unique_ID will be added. Note the attribute is optional.

Parameters:
  • proteome (Proteome object) – Proteome object to which domains will be added

  • domain_dictionary (dict) – Dictionary that maps unique_IDs to a list of one or more domain dictionaries.

  • add_new (boolean (default = True)) – If set to True then any new found domains are added to their associated protein. If False any unfound domains are not added and are skipped over. If a new domain is passed that does not have an associated protein in the passed proteome an exception will always be raised regardless of the status of this parameter.

  • safe (bool (default = True)) – If set to True then any exceptions raised during the Domain-adding process are acted on. If set to False, exceptions simply mean the domain in question is skipped. Note if set to False, pre-existing Domains with the same name would be silently overwritten (although this is not consider an error), while overwriting will trigger an exception in safe=True There are various reasons Domain addition could fail (start/end position outside of the protein limits etc.) and so if verbose=True then the cause of an exception is also printed to screen. It is highly recommend that if you choose to use safe=False you also set verbose=True.

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding domains.

Returns:

No return value, but domains are added to the Proteome object passed as the first argument.

Return type:

None

write_domains(proteome, filename, delimiter='\t', domain_types=None)[source]

Function that writes out domains to a SHEPHARD domains file. Note that attributes are converted to a string, which for simple attributes is reasonable but is not really a viable stratergy for complex objects, although this will not yeild and error.

Parameters:
  • proteome (Proteome object) – Proteome object from which the domains will be extracted from

  • filename (str) – Filename that will be used to write the new domains file

  • delimiter (str (default = '\t')) – Character (or characters) used to separate between fields. Default is ‘t’ Which is recommended to maintain compliance with default add_domains_from_file() function.

  • domain_types (list (default None)) – Lets you define a list of one or more domain types that will be written out. Domain types are passed as strings which should map to named domain types in the Proteome.

Returns:

No return type, but generates a new file with the complete set of domains from this proteome written to disk.

Return type:

None

write_domains_from_list(domain_list, filename, delimiter='\t')[source]

Function that writes out domains to a SHEPHARD domains file from a list of Domain objects. Note that attributes are converted to a string, which for simple attributes is reasonable but is not really a viable stratergy for complex objects, although this will not yeild and error.

Parameters:
  • domain_list (List of Domain objects) – List of domain objects which will be written

  • filename (str) – Filename that will be used to write the new domains file

  • delimiter (str (default = '\t')) – Character (or characters) used to separate between fields. Default is ‘\t’ which is recommended to maintain compliance with default add_domains_from_file() function

Returns:

No return type, but generates a new file with the complete set of domains from this proteome written to disk.

Return type:

None

si_tracks

Functions associated with the si_tracks module enable the reading and writing of SHEPHARD Tracks files.

add_tracks_from_dictionary(proteome, tracks_dictionary, mode, safe=True, verbose=True)[source]

Function that takes a correctly formatted Tracks dictionary and will add those Tracks to the proteins in the Proteome.

Track dictionaries are key-value pairs, where the key is a unique ID and the value is a list of dictionaries. For each sub-dictionary, there are two key-value pairs that reflect:

  • ‘track_name’ : name of the track (str)

  • ‘track_data’ : parsed list of floats (if expecting values) or strings (if expecting symbols) that should equal the length of the associated protein.

Parameters:
  • proteome (Proteome Object) – Proteome object which tracks will be added to

  • tracks_dictionary (dict) –

    Dictionary in which keys are unique IDs for proteins and the value is a list of dictionaries, where each subdictionary has the two key-value pairs:

    • track_name : name of the track (str)

    • track_data : parsed list of floats (if expecting values) or strings (if expecting symbols) that should equal the length of the associated protein.

  • mode (str (default = 'values')) – A selector that defines the type of track file to be read. Must be either ‘symbols’ or ‘values’.

  • safe (bool (default = True)) – If set to True then any exceptions raised during the track-adding process are acted on. If set to False, exceptions simply mean the Track in question is skipped. Note if set to False, pre-existing Tracks with the same name would be silently overwritten (although this is not consider an error), while overwriting will trigger an exception in safe=True. There are various reasons Track addition could fail (length does not match the protein etc) and so if verbose=True then the cause of an exception is also printed to screen. It is highly recommend that if you choose to use safe=False you also set verbose=True.

  • verbose (boolean (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding tracks.

Returns:

No return value, but tracks are added to the Proteome object passed as the first argument.

Return type:

None

add_tracks_from_file(proteome, filename, mode, delimiter='\t', return_dictionary=False, safe=True, skip_bad=True, verbose=True)[source]

Function that takes a correctly formatted shephard ‘tracks’ file and reads all Tracks into the passed Proteome.

Expect Track files to have the following format:

One protein per line, where each protein has the following information:

>>> Unique_ID    track_name    res1    res2    res3 .... resn

Where res1, res2, resn are symbol or values to be mapped to the 1st, 2nd, or nth residue. There should be the same number of res1, 2, …n entries are there are residues in the associated protein.

A couple of key points here:

  • The default delimiter is tabs (’\t’) but this can be changed with the delimiter argument

  • Each track must assign a value or a symbol to EVERY residue in the protein

Parameters:
  • proteome (Proteome) – Proteome object

  • filename (str) – Name of the shephard Domains file to read

  • mode (str (default = 'values')) – A selector that defines the type of track file to be read. Must be either ‘symbols’ or ‘values’.

  • delimiter (str (default = '\t')) – String used as a delimiter on the input file.

  • return_dictionary (bool (default = False)) – If set to true, this function will return the tracks dictionary and will NOT add that dictionary to the Proteome - i.e. the function basically becomes a parser for SHEPHARD-compliant tracks files.

  • safe (bool (default = True)) – If set to True then any exceptions raised during the Track-adding process (i.e. after file parsing) are acted on. If set to False, exceptions simply mean the site in question is skipped. Note if set to False pre-existing tracks with the same name would be silently overwritten (although this is not consider an error), while overwriting will trigger an exception in safe=True. There are various reasons site addition could fail (e.g. track does not match length of protein) so if verbose=True then the cause of an exception is also printed to screen. It is highly recommend that if you choose to use safe=False you also set verbose=True.

  • skip_bad (bool (default = True)) – Flag that means if bad lines (lines that trigger an exception) are encountered the code will just skip them. By default this is true, which adds a certain robustness to file parsing, but could also hide errors. Note that if lines are skipped a warning will be printed (regardless of verbose flag).

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding tracks.

Returns:

If return_dictionary is set to False (default) then this function has no return value, but the tracks are added to the Proteome object passed as the first argument. If return_dictionary is set to True the function returns the parsed tracks dictionary without adding the newly-read tracks to the proteome.

Return type:

None or dict

write_all_tracks_separate_files(proteome, outdirectory='.', value_fmt='%.3f', delimiter='\t')[source]

Function that writes all tracks associated with a proteome out to seperate files. This may be preferable in some situations, but in others maybe only a subset of tracks are requested, for which write_tracks() would be good, or alternatively you want all tracks in a single file, in which case write_all_tracks_single_files() would be the way to go.

The the output filenames are defined as:

> shephard_track_<trackname>.tsv

and are written to the outdirectory.

Because track files MUST be written as one per track_name, this function is equivalent to cycling through each unique track name and writing it out sequentially.

Parameters:
  • proteome (Proteome object) – Proteome object from which the Domains will be extracted from

  • outdirectory (str (default = '.')) – String that defines the output directory. By default sets to the present working directory.

  • value_fmt (str (default = "%.3f")) – Format string that will be used for values. Default = “%.3f”

  • delimiter (str (default = 't')) – Character (or characters) used to separate between fields. Default is ‘ ‘ Which is recommended to maintain compliance with default add_tracks_from_files() function.

Returns:

No return type, but generates a new file with the complete set of Domains from this Proteome written to disk.

Return type:

None

write_all_values_tracks_single_file(proteome, outfile, value_fmt='%.3f', delimiter='\t')[source]

Function that writes all tracks associated with a Proteome out to a single file. This may be preferable in some situations, but in others maybe only a subset of tracks are requested, for which write_tracks() would be good, or alternatively you want all tracks in seperate files, in which case write_all_tracks_separate_files() would be the way to go.

Parameters:
  • proteome (Proteome object) – Proteome object from which the Domains will be extracted from

  • outfile (str) – String that defines the name of the output file.

  • value_fmt (str (default = "%.3f")) – Format string that will be used for values. Default = “%.3f”

  • delimiter (str (default = '\t')) – Character (or characters) used to separate between fields. Default is ‘t’ Which is recommended to maintain compliance with default add_tracks_from_files() function.

Returns:

No return type, but generates a new file with the complete set of tracks from this Proteome written to disk.

Return type:

None

write_all_symbols_tracks_single_file(proteome, outfile, value_fmt='%.3f', delimiter='\t')[source]

Function that writes all tracks associated with a Proteome out to a single file. This may be preferable in some situations, but in others maybe only a subset of tracks are requested, for which write_tracks() would be good, or alternatively you want all tracks in seperate files, in which case write_all_tracks_separate_files() would be the way to go.

Parameters:
  • proteome (Proteome object) – Proteome object from which the Domains will be extracted from

  • outfile (str) – String that defines the name of the output file.

  • value_fmt (str (default = "%.3f")) – Format string that will be used for values.

  • delimiter (str (default = '\t')) – Character (or characters) used to separate between fields. Default is ‘t’ Which is recommended to maintain compliance with default add_tracks_from_files() function.

Returns:

No return type, but generates a new file with the complete set of tracks from this proteome written to disk.

Return type:

None

write_tracks(proteome, filename, track_name, value_fmt='%.3f', delimiter='\t', file_handle=None)[source]

Function that writes out a specific track to file in a standardized format. Note that because track files are inevitably quite big default behaviour is to only write out a single track file at a time (i.e. unlike write_domains or write_sites where ALL domains or all sites are - by default - written out, here ONLY a single type of track, defined by track_name, can be written.

To write ALL the tracks from a file, see si_tracks.write_all_tracks().

Parameters:
  • proteome (Proteome object) – Proteome object from which the Domains will be extracted from

  • filename (str) – Filename that will be used to write the new Domains file

  • track_name (str) – Name of the track to be written out.

  • value_fmt (str (default = "%.3f")) – Format string that will be used for values. Default = “%.3f”. Note that this is not a smart value so if the actual value used means that %.3f looses all meaning this will not trigger a warning, so, be careful!

  • delimiter (str (default = '\t')) – Character (or characters) used to separate between fields. Default is ‘t’ which is recommended to maintain compliance with default add_tracks_from_files() function.

  • file_handle (filehandle (_io.TextIOWrapper) or None) – If passed, output is written to this handle rather than to a new file. The filename variable is ignored in this case.

Returns:

No return type, but generates a new file with the complete set of Domains from this Proteome written to disk.

Return type:

None

write_tracks_from_list(track_list, filename, value_fmt='%.3f', delimiter='\t')[source]

Function that writes out tracks to a SHEPHARD tracks file from a list of Track objects.

Note that attributes are converted to a string, which for simple attributes is reasonable but is not really a viable stratergy for complex objects, although this will not yeild and error.

Note also that a single track file cannot have both values and symbols tracks, and this is checked first

Parameters:
  • track_list (List of Track objects) – List of track objects which will be written

  • filename (str) – Filename that will be used to write the new tracks file

  • value_fmt (str (default = "%.3f")) – Format string that will be used for values. Default = “%.3f”. Note that this is not a smart value so if the actual value used means that %.3f looses all meaning this will not trigger a warning, so, be careful!

  • delimiter (str (default = '\t')) – Character (or characters) used to separate between fields. Default is ‘\t’ which is recommended to maintain compliance with default add_tracks_from_file() function

Returns:

No return type, but generates a new file with the complete set of tracks from this proteome written to disk.

Return type:

None

si_protein_attributes

Functions associated with the si_protein_attributes module enable the reading and writing of SHEPHARD Protein attribute files.

add_protein_attributes_from_dictionary(proteome, protein_attribute_dictionary, safe=True, verbose=True)[source]

Function that takes a correctly formatted protein_atttribute dictionary and will add those attributes to the proteins in the Proteome.

protein attribute dictionaries are key-value pairs, where the key is a unique ID and the value is a list of dictionaries. For each sub-dictionary, the key-value pair reflects the attribute key-value pairing.

Parameters:
  • proteome (Proteome Object) – Proteome object to which attributes will be added

  • protein_attribute_dictionary (dict) – Dictionary that defines protein attributes. This is slightly confusing, but the keys for this dictionary is a unique protein IDs and the values is a list of dictionaries. Each of THOSE sub-dictionaries has one (or more) key:value pairs that define key:value pairs that will be associated with the protein of interest.

  • safe (boolean (default = True)) –

    If set to True then any exceptions raised during the process of adding a protein_attribute are further raised. If set to False, exceptions simply mean the protein_attribute in question is skipped. Note if set to False, pre-existing protein_attributes with the same name would be silently overwritten (although this is not consider an error), while overwriting will trigger an exception if safe=True. Default = True

    The only reason protein attribute addition could fail is if the attribute already exists, so this is effectively a flag to define if pre-existing attributes should be overwritten (False) or not (True).

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding attributes.

Returns:

No return value, but attributes are added to proteins in the Proteome object passed as the first argument

Return type:

None

add_protein_attributes_from_file(proteome, filename, delimiter='\t', return_dictionary=False, safe=True, skip_bad=True, verbose=True)[source]

Function that takes a correctly formatted ‘protein attributes’ file and reads all attributes into the proteins in the passed proteome.

The function expects protein attribute files to have the following format:

One protein defined per line (although the same protein can appear multiple times)

>>> Unique_ID, key1:value1, key2:value2, ..., keyn:valuen

A couple of key points here:

  • The default delimiter is tabs (’\t’) but this can be changed with the delimiter argument

  • Key value must be separated by a ‘:’, as a result any delimiter (other than ‘:’) can be used, but ‘:’ is reserved for this role

Parameters:
  • proteome (Proteome Object) – Proteome object to which attributes will be added.

  • filename (str) – Name of the shephard protein attributes file to read.

  • delimiter (str (default = '\t')) – String used as a delimiter on the input file.

  • return_dictionary (bool (default = False)) – If set to True, this function will return the protein_attributes dictionary and will NOT add that dictionary to the proteome - i.e. the function basically becomes a parser for SHEPHARD-compliant protein_attributes files.

  • safe (bool (default = True)) –

    If set to True then any exceptions raised during the protein_attribute-adding process are acted on. If set to False, exceptions simply mean the protein_attribute in question is skipped. Note if set to False, pre-existing protein_attributes with the same name would be silently overwritten (although this is not consider an error), while overwriting will trigger an exception if safe=True.

    The only reason protein attribute addition could fail is if the attribute already exists, so this is effectively a flag to define if pre-existing attributes should be overwritten (False) or not (True).

  • skip_bad (bool (default = True)) – Flag that means if bad lines (lines that trigger an exception) are encountered the code will just skip them. By default this is true, which adds a certain robustness to file parsing, but could also hide errors. Note that if lines are skipped a warning will be printed (regardless of verbose flag). skip_bad exclusively influences the file-reading part of the process.

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding attributes.

Returns:

If return_dictionary is set to False (default) then this function has no return value, but the protein_attributes are added to the Proteome object passed as the first argument. If return_dictionary is set to True the function returns the parsed domains_dictionary without adding the newly-read protein_attributes to the proteome.

Return type:

None or dict

write_protein_attributes(proteome, filename, delimiter='\t')[source]

Function that writes out protein attributes to file in a standardized format. Note that attributes are converted to a string, which for simple attributes is reasonable but is not really a viable stratergy for complex objects, although this will not yeild and error.

Parameters:
  • proteome (Proteome object) – Proteome object from which the domains will be extracted from

  • filename (str) – Filename that will be used to write the new domains file

  • delimiter (str (default = '\t')) – Character (or characters) used to separate between fields. Default is ‘t’, which is recommended to maintain compliance with default add_protein_attributes_from_file() function.

Returns:

No return type, but generates a new file with the complete set of protein attributes from this proteome written to disk.

Return type:

None

si_proteins

Functions associated with the si_protein_attributes module enable the reading and writing of SHEPHARD Protein files. While we include this for completeness, our general recommendation is to use FASTA files for protein information, and then write protein attributes out as separate protein attributes files. The reason for this is that this ensures easy readability of both protein sequence information and protein annotation information.

add_proteins_from_dictionary(proteome, protein_dictionary, safe=True, verbose=True)[source]

Function that takes a correctly formatted protein dictionary and will add those proteins to the Proteome.

protein dictionaries are key-value pairs, where the key is a unique ID and the value is itself a dictionary which has the following keys:

  • name - Protein name (uncontrolled vocabulary, but should be a string)

  • sequence - Amino acid sequence for the protein (note that no sanity checking is done here)

  • attributes - Dictionary of arbitrary key:value pairings (optional)

Parameters:
  • proteome (Proteome) – Proteome object to which attributes will be added

  • protein_dictionary (dict) – Dictionary that defines proteins. The keys for this dictionary is a unique protein IDs and the values is a list of dictionaries. Each of THOSE sub dictionaries contains key-value pairs are described above.

  • safe (bool (default = True)) –

    If set to True then any exceptions raised during the protein-adding process are acted on. If set to False, exceptions simply mean the protein_attribute in question is skipped. Note if set to False, pre-existing protein_attributes with the same name would be silently overwritten (although this is not consider an error), while overwriting will trigger an exception.

    The only reason protein attribute addition could fail is if the attribute already exists, so this is effectively a flag to define if pre-existing attributes should be overwritten (False) or not (True). Default = True.

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding attributes.

Returns:

No return value, but attributes are added to proteins in the Proteome object passed as the first argument.

Return type:

None

add_proteins_from_file(proteome, filename, delimiter='\t', return_dictionary=False, safe=True, skip_bad=True, verbose=True)[source]

Function that takes a correctly formatted ‘protein’ file and reads every protein into the passed proteome.

The function expects protein files to have the following format:

>>> Unique_ID name sequence key_1:value_1 key_2:value_2 ... key_n:value_n

One protein defined per line (with NO duplicates allowed - duplicate entries on the file will trigger an un-rescuable error) where key:values are optional and can be between 0 and n.

A couple of key points here:

  • The default delimiter is tabs (’\t’) but this can be changed with the delimiter argument

  • Key value must be separated by a ‘:’, as a result any delimiter (other than ‘:’) can be used, but ‘:’ is reserved for this role.

  • If a protein with the UID from the file exists in the passed proteome then this will throw an exception unless safe=False

Parameters:
  • proteome (Proteome) – Proteome object to which attributes will be added

  • filename (str) – Name of the shephard protein attributes file to read

  • delimiter (str (default = '\t')) – String used as a delimiter on the input file.

  • return_dictionary (bool (default = False)) – If set to true, this function will return the protein dictionary and will NOT add that dictionary to the proteome - i.e. the function basically becomes a parser for SHEPHARD-compliant protein files. Default = False

  • safe (bool (default = True)) – If set to True then any exceptions raised during the protein-adding process are acted on. Specifically this becomes relevant if we wish to overwrite duplicates (or throw an exception on duplicates).

  • skip_bad (bool (default = True)) – Flag that means if bad lines (lines that trigger an exception) are encountered the code will just skip them. By default this is true, which adds a certain robustness to file parsing, but could also hide errors. Note that if lines are skipped a warning will be printed (regardless of verbose flag). skip_bad exclusively influences the file-reading part of the process.

  • verbose (bool (default = True)) – Flag that defines how ‘loud’ output is. Will warn about errors on adding attributes.

Returns:

If return_dictionary is set to False (default) then this function has no return value, but the proteins are added to the Proteome object passed as the first argument. If return_dictionary is set to True the function returns the parsed proteins dictionary without adding the newly-read proteins to the proteome.

Return type:

None or dict

write_proteins(proteome, filename, delimiter='\t')[source]

Function that writes out proteins to file in a standardized format. Note that attributes are converted to a string, which for simple attributes is reasonable but is not really a viable stratergy for complex objects, although this will not yeild and error.

Writes out files with the format:

>>> Unique_ID name sequence key_1:value_1 key_2:value_2 ... key_n:value_n
Parameters:
  • proteome (Proteome) – Proteome object from which the proteins will be extracted from

  • filename (str) – Filename that will be used to write the new proteins file

  • delimiter (str (default = 't')) – Character (or characters) used to separate between fields. Default is ‘t’, which is recommended to maintain compliance with default add_protein_attributes_from_file() function

Returns:

No return type, but generates a new file with the complete set of protein attributes from this proteome written to disk.

Return type:

None