UniProtMapper package
Main Module
Field Querying
Contain all the fields used by the UniProtKB API wrapped as python classes compatible with the following boolean operators:
& : AND operation
| : OR operation
~ : NOT operation
For a list of query fields on UniProt’s website, refer to https://www.uniprot.org/help/query-fields
- class UniProtMapper.uniprotkb_fields.accession(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with a certain UniProt accession key.
- class UniProtMapper.uniprotkb_fields.accession_id(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with a certain accession ID.
- class UniProtMapper.uniprotkb_fields.active(*args, **kwargs)[source]
Bases:
BooleanField
Boolean field. If set to False, return obsolete entries
- class UniProtMapper.uniprotkb_fields.cc_mass_spectrometry(*args, **kwargs)[source]
Bases:
SimpleField
Check UniProt’s official documentation for the description of this field:
- class UniProtMapper.uniprotkb_fields.cc_webresource(*args, **kwargs)[source]
Bases:
SimpleField
Query all entries with a certain web resource.
E.g.: all proteins described in Wikipedia: >>> cc_webresource(‘Wikipedia’)
- class UniProtMapper.uniprotkb_fields.chebi(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with a certain ChEBI identifier. For more information, check the following: https://www.uniprot.org/help/chemical_data_search
- class UniProtMapper.uniprotkb_fields.database(*args, **kwargs)[source]
Bases:
SimpleField
List all entries with a cross reference to a certain database
- class UniProtMapper.uniprotkb_fields.date_created(*args, **kwargs)[source]
Bases:
DateRangeField
Query entries created within a time range defined from dates. Date format: YYYY-MM-DD.
* can also be used as a wildcard for the latest or the earliest/latest date.
- class UniProtMapper.uniprotkb_fields.date_modified(*args, **kwargs)[source]
Bases:
DateRangeField
Query entries modified within a time range defined from dates. Date format: YYYY-MM-DD.
* can also be used as a wildcard for the latest or the earliest/latest date.
- class UniProtMapper.uniprotkb_fields.date_sequence_modified(*args, **kwargs)[source]
Bases:
DateRangeField
Query entries with sequence information modified within a time range defined from dates. Date format: YYYY-MM-DD.
* can also be used as a wildcard for the latest or the earliest/latest date.
- class UniProtMapper.uniprotkb_fields.ec(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with a certain EC number - specific for enzymes
- class UniProtMapper.uniprotkb_fields.existence(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with a certain existence score. Possible values are:
Experimental evidence at protein level
Experimental evidence at transcript level
Protein inferred from homology
Protein predicted
Protein uncertain
For further information, check: https://www.uniprot.org/help/protein_existence
- class UniProtMapper.uniprotkb_fields.family(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries within a certain protein family. Further functionalities include: - Query for “name_1” while excluding entries with “name_2” as family, >>> family(‘name_1 - name_2’) - Query for “name_1” and “name_2”, in this order, >>> family(‘name_1 name_2’) - Glob-like search for all entries with family starting with “chemokine”: >>> family(‘chemokine*’)
For a full list of protein families within UniProt, check: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/similar.txt
- class UniProtMapper.uniprotkb_fields.fragment(*args, **kwargs)[source]
Bases:
BooleanField
Boolean field. If set to True, list entries with an incomplete sequence.
- class UniProtMapper.uniprotkb_fields.gene(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with a gene name. Caveat, searching for HPSE using this field will also retrieve entries with HPSE2. For a more specific search, use GeneExact.
- class UniProtMapper.uniprotkb_fields.gene_exact(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with an exact gene name match.
- class UniProtMapper.uniprotkb_fields.go(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with a certain Gene Ontology (GO) term. Further functionalities include:
Query for “name_1” while excluding entries with “name_2” as GO term,
>>> go('name_1 - name_2')
- class UniProtMapper.uniprotkb_fields.inchikey(*args, **kwargs)[source]
Bases:
SimpleField
Query entries associated with the small molecule identified by the input InChIKey
For more information, check: https://www.uniprot.org/help/chemical_data_search
- class UniProtMapper.uniprotkb_fields.interactor(*args, **kwargs)[source]
Bases:
SimpleField
Input should be a UniProt accession key (ID). Query all entries describing interactions with the protein represented by the input.
- class UniProtMapper.uniprotkb_fields.is_isoform(*args, **kwargs)[source]
Bases:
BooleanField
Boolean field. If set to True, return only isoform entries
- class UniProtMapper.uniprotkb_fields.keyword(*args, **kwargs)[source]
Bases:
SimpleField
Query all UniProt entries with a certain keyword. Further functionalities include:
Query for “name_1” while excluding entries with “name_2” as keyword,
>>> keyword('name_1 - name_2') - Query for "name_1" *and* "name_2", in this order, >>> keyword('name_1 name_2') - Glob-like search for all entries with keyword starting with "G-protein": >>> keyword('G-protein*')
For a list of keywords, check: https://www.uniprot.org/keywords?query=*
- class UniProtMapper.uniprotkb_fields.length(*args, **kwargs)[source]
Bases:
RangeField
Query entries with sequence length within a certain range.
Arguments should be integers or the wildcard * for the maximum or minimum value.
- class UniProtMapper.uniprotkb_fields.lit_author(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with at least one reference co-authored by the specified author. Extra functionalities include: - Query for “name_1” while excluding entries with “name_2” as co-author, >>> lit_author(‘name_1 - name_2’)
Query for “name_1” and “name_2”, in this order,
>>> lit_author('name_1 name_2')
Glob-like search for all entries with author starting with a Cavad:
>>> lit_author('Cavad*')
- class UniProtMapper.uniprotkb_fields.mass(*args, **kwargs)[source]
Bases:
RangeField
Query entries with mass within a certain range.
Arguments should be integersor the wildcard * for the maximum or minimum value.
- class UniProtMapper.uniprotkb_fields.organelle(*args, **kwargs)[source]
Bases:
SimpleField
Query entries for proteins encoded by a gene within a certain organelle.
E.g.: query entries encoded by the mitochondrial chromosome: >>> organelle(‘Mitochondrion’)
- class UniProtMapper.uniprotkb_fields.organism_id(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with a certain organism taxonomy ID.
For more information on the taxonomy IDs, see: https://www.uniprot.org/taxonomy?query=*
- class UniProtMapper.uniprotkb_fields.organism_name(*args, **kwargs)[source]
Bases:
QuoteField
Field for organism names. Further functionality: - Query for “name_1” while excluding “name_2”, >>> organism_name(‘name_1 - name_2’)
Query for “name_1” and “name_2”, in this order,
>>> organism_name('name_1 name_2')
Glob-like search for all entries with organism names starting with “escherichia”:
>>> organism_name('escherichia*')
For more information on the organism names, see: https://www.uniprot.org/taxonomy?query=*
- class UniProtMapper.uniprotkb_fields.plasmid(*args, **kwargs)[source]
Bases:
SimpleField
Query entries for proteins encoded by a gene that is part of a certain plasmid.
For the available plasmid vocabulary, check: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/plasmid.txt
- class UniProtMapper.uniprotkb_fields.protein_name(*args, **kwargs)[source]
Bases:
QuoteField
Field for protein names. Further functionality: - Query for “name_1” while excluding “name_2”, >>> protein_name(‘name_1 - name_2’)
Query for “name_1” and “name_2”, in this order,
>>> protein_name('name_1 name_2')
Glob-like search for all entries with protein names starting with “anti”:
>>> protein_name('anti*')
- class UniProtMapper.uniprotkb_fields.proteome(*args, **kwargs)[source]
Bases:
SimpleField
Query all proteins belonging to a certain proteome.
For more information, check: https://www.uniprot.org/proteomes
- class UniProtMapper.uniprotkb_fields.proteome_component(*args, **kwargs)[source]
Bases:
SimpleField
Query all proteins belonging to a certain proteome component.
E.g.: Lists all entries from the human chromosome 1. >>> organism_id(‘9606’) & proteome_component(‘chromosome:1’)
- class UniProtMapper.uniprotkb_fields.reviewed(*args, **kwargs)[source]
Bases:
BooleanField
Boolean field. If set to True, return only reviewed entries
- class UniProtMapper.uniprotkb_fields.scope(*args, **kwargs)[source]
Bases:
SimpleField
Query entries containing a reference that was used to gather information about <field_value>.
E.g.: for entries containing references with information about “mutagenesis”, use: >>> scope(‘mutagenesis’)
- class UniProtMapper.uniprotkb_fields.sec_acc(*args, **kwargs)[source]
Bases:
SimpleField
Query entries that were created from a merge with a certain UniProt entry.
For more information, check UniProt’s FAQ: https://www.uniprot.org/help/difference_accession_entryname
- class UniProtMapper.uniprotkb_fields.taxonomy_id(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with a certain organism taxonomy ID.
For more information on the taxonomy IDs, see: https://www.uniprot.org/taxonomy?query=*
- class UniProtMapper.uniprotkb_fields.taxonomy_name(*args, **kwargs)[source]
Bases:
SimpleField
Query for entries with a certain organism taxonomy name.
For more information on thetaxonomy names, see: https://www.uniprot.org/taxonomy?query=*
E.g: to search for all mammal proteins: >>> taxonomy_name(‘mammal’)
- class UniProtMapper.uniprotkb_fields.tissue(*args, **kwargs)[source]
Bases:
SimpleField
Query entries containing a reference describing the protein sequence obtained from a clone isolated from a certain tissue. For a full list of UniProt’s tissue vocabulary, check:
Further functionalities include: - Query for “name_1” while excluding entries with “name_2” as tissue, >>> tissue(‘name_1 - name_2’) - Query for “name_1” and “name_2”, in this order, >>> tissue(‘name_1 name_2’) - Glob-like search for all entries with tissue starting with “brain”: >>> tissue(‘brain*’)
- class UniProtMapper.uniprotkb_fields.virus_host_id(*args, **kwargs)[source]
Bases:
SimpleField
Search for all entries belonging to viruses that infect the query host organism. Input should be the taxonomy ID of the host organism.
For more information on the ID for your organism, see: https://www.uniprot.org/taxonomy?query=*
- class UniProtMapper.uniprotkb_fields.virus_host_name(*args, **kwargs)[source]
Bases:
SimpleField
Search for all entries belonging to viruses that infect the query host organism. Input should be the name of the host organism. Both common and scientific names should work.
For more information on the organism names, see: https://www.uniprot.org/taxonomy?query=*
- class UniProtMapper.uniprotkb_fields.xref(*args, **kwargs)[source]
Bases:
SimpleField
List all entries with a cross reference to a certain database. E.g. of extra functionality:
Query all entries with a cross-reference to the PDB database entry 1aut:
>>> xref('pdb-1aut')
This could be specially useful to retrieve all entries involved within a certain pathway, or another common cross-referenced database.
For a list of supported databases, check UniProtMapper.utils.read_fields_table()[‘returned_field’].
- class UniProtMapper.uniprotkb_fields.xref_count(*args, **kwargs)[source]
Bases:
XRefCountField
Query entries with a certain number of cross-references. All fields within UniProtMapper.utils.read_fields_table()[‘returned_field’] are supported.
E.g.: to query entries with 20 or more cross-references to “PDB”, use: >>> xref_count(‘pdb’, 20, ‘*’)
Or: >>> xref_count(‘xref_pdb’, 20, ‘*’)
Utilities
Module with utility functions for the package.
- UniProtMapper.utils.decode_results(response, file_format, compressed)[source]
Decodes the response from the UniProt API.
- UniProtMapper.utils.fetch_cross_referenced_db_details(output_path: str | None = None, save: bool = True) dict [source]
Downloads the latest details on UniProt cross references and stores it. This list of cross references can be found here: https://www.uniprot.org/database?query=*
- Parameters:
output_path – the path to save the downloaded file with the cross references details. If left as None, will update the file stored in the package. Defaults to None.
save – whether to save or not the retrieved json. Defaults to True.
- Returns:
the json with the cross references details.
- Return type:
- UniProtMapper.utils.print_progress_batches(batch_index, size, retrieved, failed)[source]
Prints the progress of a batch process.
- UniProtMapper.utils.read_fields_table()[source]
Return the fields table from the package resources as a DataFrame containing rows as the information available in UniProt and the following columns:
label: the label of the information once retrieved from the API.
returned_field: the name used in the API to return that information.
field_type: the type of information, e.g.: sequence-related, function…
has_full_version: whether the annotated field contains the full version of the
dataset or not (in case of cross-references). - type: the type of data. Either “cross_reference” or “uniprot_field”.