Source code for shephard.apis.fasta

"""
SHEPHARD: 
Sequence-based Hierarchical and Extendable Platform for High-throughput Analysis of Region of Disorder

Authors: Garrett M. Ginell & Alex S. Holehouse
Contact: (g.ginell@wustl.edu)

Holehouse Lab - Washington University in St. Louis
"""

import protfasta
from shephard.proteome import Proteome
from shephard.exceptions import APIException

SHEPHARD_ATTRIBUTE_SPLITTER='SHPRD_ATTRIBUTES='

## ------------------------------------------------------------------------
##
[docs]def fasta_to_proteome(filename, proteome=None, build_unique_ID=None, build_attributes=None, use_header_as_unique_ID=False, force_overwrite=False, invalid_sequence_action='fail'): """ Stand alone function that allows the user to build a Proteome from a standard FASTA file, or add sequences in a FASTA file to an existing Proteome. This function can be used to read additional sequences into an existing Proteome object, or create a new Proteome object from a FASTA file. In addition, some control over how invalid sequences should be dealt with are defined by the invalid_sequence_action flag. The input filename must be a FASTA file without duplicate headers. If the file has duplicate headers and these have to be further processed we suggest using the protfasta (https://protfasta.readthedocs.io/) package to parse through the FASTA file first creating a santizied input FASTA. The FASTA file is parsed into a set of proteins, each of which has (1) a unique ID, (2) a name, (3) a sequence, and (4) optionally, a dictionary of attributes. The protein name is defined as the full FASTA header, and the sequence based on the FASTA record sequence. Sequence validation is also provided at the point of file-parsing. The unique ID and attributes are discused below. Each protein in a Proteome must have a unique_ID associated with it. There are three ways a FASTA file can be used to generate a unique ID: 1. By using the FASTA header as a unique ID, although this fails if there are non-unique FASTA headers. 2. By parsing the FASTA header, to extract out a unique ID. For example, FASTA files generated by other databases often include unique identifiers in a structured way, which could be extracted in a consistent manner for every FASTA record. 3. By incrementing an automatically unique ID, removing any dependence on the FASTA header itself. These options can be selected using the flags provided in the function signature. Note that if both using the FASTA header directly and parsing the FASTA header are selected an exception will be raised as only one of these two can be requested simultaneously. By default, a numerically unique value is used. Note that if your are reading FASTA files generated by UniProt, we recommend the api.uniprot functions instead of the more generic api.fasta functions. To build protein attributes, this can in principle be parsed out of the FASTA header by providing a build_attributes() function. However, we would in general suggest that the better way to do this would be annotate a Proteome with attribibutes and the same those using the associated interfaces.si_protein_attributes functionality. Parameters ------------ filename : string Name of the FASTA file we're going to parse in. proteome : Proteome (default = None) If a Proteome object is provided the FASTA file will be read and added to the existing proteome, whereas if set to None a new Proteome will be generated. build_unique_ID : funct (default = None) This parameter allows a user-defined function that is used to convert the FASTA header to a (hopefully) unique string. This can be useful if the FASTA header is well structured and includes a specific, useful unique ID that can be used as the unique_ID. Specifically, the build_unique_ID function should take in the a str (a FASTA header) and return a string which will be used as a unique ID. build_attributes : funct (default = None) This parameter allows a user-defined function that allows meta-information from the FASTA header to be converted into protein attributes. Specifically, build_attributes should be a function which takes in the FASTA header as a string and returns a dictionary where key:value pairs are assigned as protein attributes. This can be useful if the FASTA header is well-structured. use_header_as_unique_ID : bool (default = False) If this flag is set to true, it means the unique_ID is set to the FASTA file header. If non-unique headers are found this will trigger an exception. force_overwrite : bool (default = False) If this flag is set to true and we encounter a unique_ID that is already in the proteome the newer value overwrites the older one. This is mostly useful if you are adding in a file with known duplicate entries OR combining multiple FASTA files where you know there's some duplications. Important - if we're building unique IDs based on numerical record indices then EVERY FASTA entry will be given a unique_ID (meaning force_overwrite is irrelevant in this case). invalid_sequence_action : str (default = 'fail') Selector which defines the behaviour if a sequence with a non- standard amino acid is encountered. Valid options and their meaning are listed below: * ``ignore`` - invalid sequences are completely ignored. * ``fail`` - invalid sequence cause parsing to fail and throw an exception. * ``remove`` - invalid sequences are removed. * ``convert`` - invalid residues are converted to valid residues. * ``convert-ignore`` - invalid sequences are converted to valid sequences and any remaining invalid residues are ignored. Returns -------- Proteome Returns an initialized Proteome object """ # parameter sanity checking if use_header_as_unique_ID is True and build_unique_ID is not None: raise APIException('Cannot simultaneously set use_header_as_unique_ID = True and build_unique_ID to not None') # read in the fasta file using protfasta fasta_dictionary = protfasta.read_fasta(filename, invalid_sequence_action=invalid_sequence_action) # extract the keys (FASTA headers) and initialize the record_index (internal # numbering used for construction. Also initialize the proteom_dict, which is # a dictionary of protein entries we passed to Proteome. record_index = 0 # IF we're adding to a new proteome this bit of code sets the record_index to the largest new integer # such that we can add multiple proteomes in succession and we'll get a proteome where there are numerically # contigous unique_IDs. Note we only do this if we'll be using the record_index if proteome is not None and (build_unique_ID is None or use_header_as_unique_ID is None): numeric_record_ids = [] for uid in proteome.proteins: try: numeric_record_ids.append(int(uid)) except ValueError: pass if len(numeric_record_ids) > 0: record_index = max(numeric_record_ids)+1 # initialize the empty list proteome_list = [] # for each entry for k in fasta_dictionary: # create a key-value pair where # key = the unique record_index (this is only used for internal structure # within this function to assure we never overwrite in this dictionary # # value = a four-position list where the positions reflect the following # [0] = amino acid sequence # [1] = name (this can be anything) # [2] = unique_ID - this should be a unique identifier that can be used # to cross-reference this entry to other data. If extrat_unique_ID # is passed we try to use this # [3] = attribute dictionary (we set this to None) # get unique_ID if build_unique_ID: unique_ID = build_unique_ID(k) elif use_header_as_unique_ID is True: unique_ID = k else: unique_ID = record_index # build an attributes dictionary using the user-provided custom function if build_attributes: attributes = build_attributes(k) else: attributes = {} # now create an input dictionary orbject newdict = {} newdict['sequence'] = str(fasta_dictionary[k]) newdict['name'] = k newdict['unique_ID'] = unique_ID newdict['attributes'] = attributes proteome_list.append(newdict) record_index = record_index + 1 # finally if a proteome was provided then if proteome is not None: proteome.add_proteins(proteome_list, force_overwrite=force_overwrite) return proteome else: # no proteome provided so build a new proteome and return it return Proteome(proteome_list, force_overwrite=force_overwrite)
## ------------------------------------------------------------------------ ## def shephard_fasta_to_proteome(filename, proteome = None, force_overwrite=False, invalid_sequence_action='fail'): """ Stand alone function that allows the user to build a proteome from a FASTA file generated by SHEPHARD (using the proteome_to_fasta() function. When SHEPHARD generates a FASTA file it uses a general convention for encoding the unique ID, protein name, and attributes. Specifically, the FASTA header has the form >SHPRD|<UNIQUE_ID>|<PROTEIN_NAME|SHPRD_ATTRIBUTES=<ATTRIBUTE_NAME>=<ATTRIBUTE_VALUE>\t Where an arbitrary number of name/value attribute paris can be encoded separated by a tab character. WARNING: The support for protein attributes in FASTA files is included mainly for easy sharing of FASTA files that are usable outside of SHEPHARD. We recommend using interfaces.si_protein_attributes functions for dealing with Protein attributes. Parameters ------------ filename : str Name of the FASTA file we're going to parse in. proteome : Proteome (default = None) If a Proteome object is provided the FASTA file will be read and added to the existing proteome, whereas if set to None a new Proteome will be generated. force_overwrite : bool (default = False) If this flag is set to true and we encounter a unique_ID that is already in the proteome the newer value overwrites the older one. This is mostly useful if you are adding in a file with known duplicate entries OR combining multiple FASTA files where you know there's some duplications. Important - if we're building unique IDs based on numerical record indices then EVERY FASTA entry will be given a unique_ID (meaning force_overwrite is irrelevant in this case). invalid_sequence_action : str (default = 'fail') Selector which defines the behaviour if a sequence with a non- standard amino acid is encountered. Valid options and their meaning are listed below: * ``ignore`` - invalid sequences are completely ignored. * ``fail`` - invalid sequence cause parsing to fail and throw an exception. * ``remove`` - invalid sequences are removed. * ``convert`` - invalid residues are converted to valid residues. * ``convert-ignore`` - invalid sequences are converted to valid sequences and any remaining invalid residues are ignored. Returns -------- Proteome Object Returns an initialized Proteome object """ # read in the fasta file using protfasta fasta_dictionary = protfasta.read_fasta(filename, invalid_sequence_action=invalid_sequence_action) # initialize the empty list proteome_list = [] # for each entry for k in fasta_dictionary: # because we know what the header format will be we can be definitive about extracting the relevant information fasta_split = k.split('|') # ENSURE EVERY single line is a valid if fasta_split[0] != "SHPRD": raise APIException('Trying to parse a FASTA file that is expected to be SHEPHARD generated but formatting does not comply [on entry %s in file %s]' % (k, filename)) # extract out try: # get the unique ID unique_ID = fasta_split[1] # then take everything after the unique_ID tmp = "|".join(fasta_split[2:]) attributes_string = tmp.split(SHEPHARD_ATTRIBUTE_SPLITTER) name = attributes_string[0] except IndexError: raise APIException('Trying to parse a FASTA file that is expected to be SHEPHARD generated but formatting does not comply [on entry %s in file %s]' % (k, filename)) attributes_dict = {} if len(attributes_string) > 1: attributes_string_s = attributes_string[1].split('\t') for a in attributes_string_s: local_k = a.strip().split('=')[0].strip() local_v = a.strip().split('=')[0].strip() attributes_dict[local_k] = local_v # now create an protein dictionary object and populate! newdict = {} newdict['sequence'] = str(fasta_dictionary[k]) newdict['name'] = name newdict['unique_ID'] = unique_ID newdict['attributes'] = attributes_dict proteome_list.append(newdict) # finally if a proteome was provided then if proteome is not None: proteome.add_proteins(proteome_list, force_overwrite=force_overwrite) return proteome else: # no proteome provided so build a new proteome and return it return Proteome(proteome_list, force_overwrite=force_overwrite) ## ------------------------------------------------------------------------ ##
[docs]def proteome_to_fasta(filename, proteome, include_attributes_in_header=False): """ Stand alone function that allows the user to write a SHEPHARD-specific FASTA file from a Proteome object. WARNING: The support for protein attributes in FASTA files is included mainly for easy sharing of FASTA files that are usable outside of SHEPHARD. We recommend using interfaces.si_protein_attributes functions for dealing with Protein attributes. Parameters ------------ filename : str Name of the FASTA file we're going to write to. We will automatically overwrite a file if it's there, so be careful! Note that no extension is added in part because FASTA files can be .f/.fa/.fasta. Recommended a .fasta file extension. proteome : Proteome The proteome object that will be written to disk include_attributes_in_header : bool (default = falseFalse) Flag which if set to true means each Protein's attributes will be included in the FASTA header. We generally do not recommend this other than times when sharing annotated FASTA files outside of a SHEPHARD ecosystem would be useful. Returns ----------- None No return object but a new file will be written """ # build output list with or without the unique_ID outlist = [] for protein in proteome: # this is where we define the FASTA header... fasta_header = "SHPRD|%s|%s" % (protein.unique_ID, protein.name) # this is where we append the FASTA header with attributes if include_attributes_in_header: fasta_header = fasta_header + " " + SHEPHARD_ATTRIBUTE_SPLITTER for k in protein.attributes: # these lines ensure there are no tabs in the attribute names of # values before we append them to the header file, ensuring that # IF we want to read the fasta header info back in to attributes # we can be confident that hidden tabs in the variables won't # mess things up! k_fixed = k.replace('\t', ' ' ) i = protein.attribute(k) i_fixed = i.replace('\t', ' ') fasta_header = fasta_header + '\t' + "%s=%s" %(k_fixed, i_fixed) outlist.append([fasta_header, protein.sequence]) # use the protfasta library to write the file to disk protfasta.write_fasta(outlist, filename, linelength=80)