"""
SHEPHARD:
Sequence-based Hierarchical and Extendable Platform for High-throughput Analysis of Region of Disorder
Authors: Garrett M. Ginell & Alex S. Holehouse
Contact: (g.ginell@wustl.edu)
Holehouse Lab - Washington University in St. Louis
"""
import protfasta
from shephard.proteome import Proteome
from shephard.exceptions import APIException
SHEPHARD_ATTRIBUTE_SPLITTER='SHPRD_ATTRIBUTES='
## ------------------------------------------------------------------------
##
[docs]def fasta_to_proteome(filename,
proteome=None,
build_unique_ID=None,
build_attributes=None,
use_header_as_unique_ID=False,
force_overwrite=False,
invalid_sequence_action='fail'):
"""
Stand alone function that allows the user to build a Proteome from a
standard FASTA file, or add sequences in a FASTA file to an existing
Proteome.
This function can be used to read additional sequences into an existing
Proteome object, or create a new Proteome object from a FASTA file.
In addition, some control over how invalid sequences should be dealt
with are defined by the invalid_sequence_action flag.
The input filename must be a FASTA file without duplicate headers. If
the file has duplicate headers and these have to be further processed
we suggest using the protfasta (https://protfasta.readthedocs.io/)
package to parse through the FASTA file first creating a santizied
input FASTA.
The FASTA file is parsed into a set of proteins, each of which has (1)
a unique ID, (2) a name, (3) a sequence, and (4) optionally, a
dictionary of attributes.
The protein name is defined as the full FASTA header, and the
sequence based on the FASTA record sequence. Sequence validation is
also provided at the point of file-parsing. The unique ID and
attributes are discused below.
Each protein in a Proteome must have a unique_ID associated with it.
There are three ways a FASTA file can be used to generate a unique ID:
1. By using the FASTA header as a unique ID, although this fails
if there are non-unique FASTA headers.
2. By parsing the FASTA header, to extract out a unique ID. For
example, FASTA files generated by other databases often include
unique identifiers in a structured way, which could be extracted
in a consistent manner for every FASTA record.
3. By incrementing an automatically unique ID, removing any
dependence on the FASTA header itself.
These options can be selected using the flags provided in the function
signature. Note that if both using the FASTA header directly and parsing
the FASTA header are selected an exception will be raised as only one
of these two can be requested simultaneously.
By default, a numerically unique value is used. Note that if your are
reading FASTA files generated by UniProt, we recommend the api.uniprot
functions instead of the more generic api.fasta functions.
To build protein attributes, this can in principle be parsed out of the
FASTA header by providing a build_attributes() function. However, we
would in general suggest that the better way to do this would be annotate
a Proteome with attribibutes and the same those using the associated
interfaces.si_protein_attributes functionality.
Parameters
------------
filename : string
Name of the FASTA file we're going to parse in.
proteome : Proteome (default = None)
If a Proteome object is provided the FASTA file will be read and
added to the existing proteome, whereas if set to None a new
Proteome will be generated.
build_unique_ID : funct (default = None)
This parameter allows a user-defined function that is used to
convert the FASTA header to a (hopefully) unique string. This can
be useful if the FASTA header is well structured and includes
a specific, useful unique ID that can be used as the unique_ID.
Specifically, the build_unique_ID function should take in the
a str (a FASTA header) and return a string which will be used
as a unique ID.
build_attributes : funct (default = None)
This parameter allows a user-defined function that allows
meta-information from the FASTA header to be converted into
protein attributes. Specifically, build_attributes should be a
function which takes in the FASTA header as a string and returns
a dictionary where key:value pairs are assigned as protein attributes.
This can be useful if the FASTA header is well-structured.
use_header_as_unique_ID : bool (default = False)
If this flag is set to true, it means the unique_ID is set to the FASTA
file header. If non-unique headers are found this will trigger an
exception.
force_overwrite : bool (default = False)
If this flag is set to true and we encounter a unique_ID that is
already in the proteome the newer value overwrites the older one.
This is mostly useful if you are adding in a file with known
duplicate entries OR combining multiple FASTA files where you know
there's some duplications. Important - if we're building unique IDs
based on numerical record indices then EVERY FASTA entry will be given
a unique_ID (meaning force_overwrite is irrelevant in this case).
invalid_sequence_action : str (default = 'fail')
Selector which defines the behaviour if a sequence with a non-
standard amino acid is encountered. Valid options and their meaning
are listed below:
* ``ignore`` - invalid sequences are completely ignored.
* ``fail`` - invalid sequence cause parsing to fail and throw an exception.
* ``remove`` - invalid sequences are removed.
* ``convert`` - invalid residues are converted to valid residues.
* ``convert-ignore`` - invalid sequences are converted to valid sequences and any remaining invalid residues are ignored.
Returns
--------
Proteome
Returns an initialized Proteome object
"""
# parameter sanity checking
if use_header_as_unique_ID is True and build_unique_ID is not None:
raise APIException('Cannot simultaneously set use_header_as_unique_ID = True and build_unique_ID to not None')
# read in the fasta file using protfasta
fasta_dictionary = protfasta.read_fasta(filename, invalid_sequence_action=invalid_sequence_action)
# extract the keys (FASTA headers) and initialize the record_index (internal
# numbering used for construction. Also initialize the proteom_dict, which is
# a dictionary of protein entries we passed to Proteome.
record_index = 0
# IF we're adding to a new proteome this bit of code sets the record_index to the largest new integer
# such that we can add multiple proteomes in succession and we'll get a proteome where there are numerically
# contigous unique_IDs. Note we only do this if we'll be using the record_index
if proteome is not None and (build_unique_ID is None or use_header_as_unique_ID is None):
numeric_record_ids = []
for uid in proteome.proteins:
try:
numeric_record_ids.append(int(uid))
except ValueError:
pass
if len(numeric_record_ids) > 0:
record_index = max(numeric_record_ids)+1
# initialize the empty list
proteome_list = []
# for each entry
for k in fasta_dictionary:
# create a key-value pair where
# key = the unique record_index (this is only used for internal structure
# within this function to assure we never overwrite in this dictionary
#
# value = a four-position list where the positions reflect the following
# [0] = amino acid sequence
# [1] = name (this can be anything)
# [2] = unique_ID - this should be a unique identifier that can be used
# to cross-reference this entry to other data. If extrat_unique_ID
# is passed we try to use this
# [3] = attribute dictionary (we set this to None)
# get unique_ID
if build_unique_ID:
unique_ID = build_unique_ID(k)
elif use_header_as_unique_ID is True:
unique_ID = k
else:
unique_ID = record_index
# build an attributes dictionary using the user-provided custom function
if build_attributes:
attributes = build_attributes(k)
else:
attributes = {}
# now create an input dictionary orbject
newdict = {}
newdict['sequence'] = str(fasta_dictionary[k])
newdict['name'] = k
newdict['unique_ID'] = unique_ID
newdict['attributes'] = attributes
proteome_list.append(newdict)
record_index = record_index + 1
# finally if a proteome was provided then
if proteome is not None:
proteome.add_proteins(proteome_list, force_overwrite=force_overwrite)
return proteome
else:
# no proteome provided so build a new proteome and return it
return Proteome(proteome_list, force_overwrite=force_overwrite)
## ------------------------------------------------------------------------
##
def shephard_fasta_to_proteome(filename,
proteome = None,
force_overwrite=False,
invalid_sequence_action='fail'):
"""
Stand alone function that allows the user to build a proteome
from a FASTA file generated by SHEPHARD (using the proteome_to_fasta()
function. When SHEPHARD generates a FASTA file it uses a general
convention for encoding the unique ID, protein name, and attributes.
Specifically, the FASTA header has the form
>SHPRD|<UNIQUE_ID>|<PROTEIN_NAME|SHPRD_ATTRIBUTES=<ATTRIBUTE_NAME>=<ATTRIBUTE_VALUE>\t
Where an arbitrary number of name/value attribute paris can be encoded
separated by a tab character.
WARNING: The support for protein attributes in FASTA files is included
mainly for easy sharing of FASTA files that are usable outside of
SHEPHARD. We recommend using interfaces.si_protein_attributes functions
for dealing with Protein attributes.
Parameters
------------
filename : str
Name of the FASTA file we're going to parse in.
proteome : Proteome (default = None)
If a Proteome object is provided the FASTA file will be read and
added to the existing proteome, whereas if set to None a new
Proteome will be generated.
force_overwrite : bool (default = False)
If this flag is set to true and we encounter a unique_ID that is
already in the proteome the newer value overwrites the older one.
This is mostly useful if you are adding in a file with known
duplicate entries OR combining multiple FASTA files where you know
there's some duplications. Important - if we're building unique IDs
based on numerical record indices then EVERY FASTA entry will be given
a unique_ID (meaning force_overwrite is irrelevant in this case).
invalid_sequence_action : str (default = 'fail')
Selector which defines the behaviour if a sequence with a non-
standard amino acid is encountered. Valid options and their meaning
are listed below:
* ``ignore`` - invalid sequences are completely ignored.
* ``fail`` - invalid sequence cause parsing to fail and throw an exception.
* ``remove`` - invalid sequences are removed.
* ``convert`` - invalid residues are converted to valid residues.
* ``convert-ignore`` - invalid sequences are converted to valid sequences and any remaining invalid residues are ignored.
Returns
--------
Proteome Object
Returns an initialized Proteome object
"""
# read in the fasta file using protfasta
fasta_dictionary = protfasta.read_fasta(filename, invalid_sequence_action=invalid_sequence_action)
# initialize the empty list
proteome_list = []
# for each entry
for k in fasta_dictionary:
# because we know what the header format will be we can be definitive about extracting the relevant information
fasta_split = k.split('|')
# ENSURE EVERY single line is a valid
if fasta_split[0] != "SHPRD":
raise APIException('Trying to parse a FASTA file that is expected to be SHEPHARD generated but formatting does not comply [on entry %s in file %s]' % (k, filename))
# extract out
try:
# get the unique ID
unique_ID = fasta_split[1]
# then take everything after the unique_ID
tmp = "|".join(fasta_split[2:])
attributes_string = tmp.split(SHEPHARD_ATTRIBUTE_SPLITTER)
name = attributes_string[0]
except IndexError:
raise APIException('Trying to parse a FASTA file that is expected to be SHEPHARD generated but formatting does not comply [on entry %s in file %s]' % (k, filename))
attributes_dict = {}
if len(attributes_string) > 1:
attributes_string_s = attributes_string[1].split('\t')
for a in attributes_string_s:
local_k = a.strip().split('=')[0].strip()
local_v = a.strip().split('=')[0].strip()
attributes_dict[local_k] = local_v
# now create an protein dictionary object and populate!
newdict = {}
newdict['sequence'] = str(fasta_dictionary[k])
newdict['name'] = name
newdict['unique_ID'] = unique_ID
newdict['attributes'] = attributes_dict
proteome_list.append(newdict)
# finally if a proteome was provided then
if proteome is not None:
proteome.add_proteins(proteome_list, force_overwrite=force_overwrite)
return proteome
else:
# no proteome provided so build a new proteome and return it
return Proteome(proteome_list, force_overwrite=force_overwrite)
## ------------------------------------------------------------------------
##
[docs]def proteome_to_fasta(filename, proteome, include_attributes_in_header=False):
"""
Stand alone function that allows the user to write a SHEPHARD-specific
FASTA file from a Proteome object.
WARNING: The support for protein attributes in FASTA files is included
mainly for easy sharing of FASTA files that are usable outside of
SHEPHARD. We recommend using interfaces.si_protein_attributes functions
for dealing with Protein attributes.
Parameters
------------
filename : str
Name of the FASTA file we're going to write to. We will automatically
overwrite a file if it's there, so be careful! Note that no extension
is added in part because FASTA files can be .f/.fa/.fasta.
Recommended a .fasta file extension.
proteome : Proteome
The proteome object that will be written to disk
include_attributes_in_header : bool (default = falseFalse)
Flag which if set to true means each Protein's attributes will be
included in the FASTA header. We generally do not recommend this
other than times when sharing annotated FASTA files outside of a
SHEPHARD ecosystem would be useful.
Returns
-----------
None
No return object but a new file will be written
"""
# build output list with or without the unique_ID
outlist = []
for protein in proteome:
# this is where we define the FASTA header...
fasta_header = "SHPRD|%s|%s" % (protein.unique_ID, protein.name)
# this is where we append the FASTA header with attributes
if include_attributes_in_header:
fasta_header = fasta_header + " " + SHEPHARD_ATTRIBUTE_SPLITTER
for k in protein.attributes:
# these lines ensure there are no tabs in the attribute names of
# values before we append them to the header file, ensuring that
# IF we want to read the fasta header info back in to attributes
# we can be confident that hidden tabs in the variables won't
# mess things up!
k_fixed = k.replace('\t', ' ' )
i = protein.attribute(k)
i_fixed = i.replace('\t', ' ')
fasta_header = fasta_header + '\t' + "%s=%s" %(k_fixed, i_fixed)
outlist.append([fasta_header, protein.sequence])
# use the protfasta library to write the file to disk
protfasta.write_fasta(outlist, filename, linelength=80)