easy-entrez#

Python REST API for Entrez E-Utilities, aiming to be easy to use and reliable.

Easy-entrez:

makes common tasks easy thanks to simple Pythonic API,
is typed and integrates well with mypy,
is tested on Windows, Mac and Linux across Python 3.7 to 3.12,
is limited in scope, allowing to focus on the reliability of the core code,
does not use the stateful API as it is error-prone as seen on example of the alternative entrezpy.

Examples#

from easy_entrez import EntrezAPI

entrez_api = EntrezAPI(
    'your-tool-name',
    'e@mail.com',
    # optional
    return_type='json'
)

# find up to 10 000 results for cancer in human
result = entrez_api.search('cancer AND human[organism]', max_results=10_000)

# data will be populated with JSON or XML (depending on the `return_type` value)
result.data

See more in the Demo notebook and documentation.

For a real-world example (i.e. used for this publication) see notebooks in multi-omics-state-of-the-field repository.

Fetching genes for a variant from dbSNP#

Fetch the SNP record for rs6311:

rs6311 = entrez_api.fetch(['rs6311'], max_results=1, database='snp').data[0]
rs6311

Display the result:

from easy_entrez.parsing import xml_to_string

print(xml_to_string(rs6311))

Find the gene names for rs6311:

namespaces = {'ns0': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}
genes = [
    name.text
    for name in rs6311.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
]
print(genes)

['HTR2A']

Fetch data for multiple variants at once:

result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')
gene_names = {
    'rs' + document_summary.get('uid'): [
        element.text
        for element in document_summary.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
    ]
    for document_summary in result.data
}
print(gene_names)

{'rs6311': ['HTR2A'], 'rs662138': ['SLC22A1']}

Obtaining the chromosomal position from SNP rsID number#

from pandas import DataFrame

result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')

variant_positions = DataFrame([
    {
        'id': 'rs' + document_summary.get('uid'),
        'chromosome': chromosome,
        'position': position
    }
    for document_summary in result.data
    for chrom_and_position in document_summary.findall('.//ns0:CHRPOS', namespaces)
    for chromosome, position in [chrom_and_position.text.split(':')]
])

variant_positions

id

chromosome

position

0

rs6311

13

46897343

1

rs662138

6

160143444

	id	chromosome	position
0	rs6311	13	46897343
1	rs662138	6	160143444

Converting full variation/mutation data to tabular format#

Parsing utilities can quickly extract the data to a VariantSet object holding pandas DataFrames with coordinates and alternative alleles frequencies:

from easy_entrez.parsing import parse_dbsnp_variants

variants = parse_dbsnp_variants(result)
variants

<VariantSet with 2 variants>

To get the coordinates:

variants.coordinates

rs_id

ref

alts

chrom

pos

chrom_prev

pos_prev

consequence

rs6311

C

A,T

13

46897343

13

47471478

upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant

rs662138

C

G

6

160143444

6

160564476

intron_variant

rs_id	ref	alts	chrom	pos	chrom_prev	pos_prev	consequence
rs6311	C	A,T	13	46897343	13	47471478	upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant
rs662138	C	G	6	160143444	6	160564476	intron_variant

For frequencies:

variants.alt_frequencies.head(5)  # using head to only display first 5 for brevity

rs_id

allele

source_frequency

total_count

study

count

0

rs6311

T

0.44349

2221

1000Genomes

984.991

1

rs6311

T

0.411261

1585

ALSPAC

651.849

2

rs6311

T

0.331696

1486

Estonian

492.9

3

rs6311

T

0.35

14

GENOME_DK

4.9

4

rs6311

T

0.402529

56309

GnomAD

22666

	rs_id	allele	source_frequency	total_count	study	count
0	rs6311	T	0.44349	2221	1000Genomes	984.991
1	rs6311	T	0.411261	1585	ALSPAC	651.849
2	rs6311	T	0.331696	1486	Estonian	492.9
3	rs6311	T	0.35	14	GENOME_DK	4.9
4	rs6311	T	0.402529	56309	GnomAD	22666

Obtaining the SNP rs ID number from chromosomal position#

You can use the query string directly:

results = entrez_api.search(
    '13[CHROMOSOME] AND human[ORGANISM] AND 31873085[POSITION]',
    database='snp',
    max_results=10
)
print(results.data['esearchresult']['idlist'])

['59296319', '17076752', '7336701', '4']

Or pass a dictionary (no validation of arguments is performed, AND conjunction is used):

results = entrez_api.search(
    dict(chromosome=13, organism='human', position=31873085),
    database='snp',
    max_results=10
)
print(results.data['esearchresult']['idlist'])

['59296319', '17076752', '7336701', '4']

The base position should use the latest genome assembly (GRCh38 at the time of writing); you can use the position in previous assembly coordinates by replacing POSITION with POSITION_GRCH37. For more information of the arguments accepted by the SNP database see the entrez help page on NCBI website.

Obtaining amino acids change information for variants in given range#

First we search for dbSNP rs identifiers for variants in given region:

dbsnp_ids = (
    entrez_api
    .search(
        '12[CHROMOSOME] AND human[ORGANISM] AND 21178600:21178720[POSITION]',
        database='snp',
        max_results=100
    )
    .data
    ['esearchresult']
    ['idlist']
)

Then fetch the variant data for identifiers:

variant_data = entrez_api.fetch(
    ['rs' + rs_id for rs_id in dbsnp_ids],
    max_results=10,
    database='snp'
)

And parse the data, extracting the HGVS out of summary:

from easy_entrez.parsing import parse_dbsnp_variants
from pandas import Series


def select_protein_hgvs(items):
    return [
        [sequence, hgvs]
        for entry in items
        for sequence, hgvs in [entry.split(':')]
        if hgvs.startswith('p.')
    ]


protein_hgvs = (
    parse_dbsnp_variants(variant_data)
    .summary
    .HGVS
    .apply(select_protein_hgvs)
    .explode()
    .dropna()
    .apply(Series)
    .rename(columns={0: 'sequence', 1: 'hgvs'})
)
protein_hgvs.head()

rs_id

sequence

hgvs

rs1940853486

NP_006437.3

p.Gly203Ter

rs1940853414

NP_006437.3

p.Glu202Gly

rs1940853378

NP_006437.3

p.Glu202Lys

rs1940853299

NP_006437.3

p.Lys201Thr

rs1940852987

NP_006437.3

p.Asp198Glu

rs_id	sequence	hgvs
rs1940853486	NP_006437.3	p.Gly203Ter
rs1940853414	NP_006437.3	p.Glu202Gly
rs1940853378	NP_006437.3	p.Glu202Lys
rs1940853299	NP_006437.3	p.Lys201Thr
rs1940852987	NP_006437.3	p.Asp198Glu

Fetching more than 10 000 entries#

Use in_batches_of method to fetch more than 10k entries (e.g. variant_ids):

snps_result = (
    entrez.api
    .in_batches_of(1_000)
    .fetch(variant_ids, max_results=5_000, database='snp')
)

The result is a dictionary with keys being identifiers used in each batch (because the Entrez API does not always return the indentifiers back) and values representing the result. You can use parse_dbsnp_variants directly on this dictionary.

Find PubMed ID from DOI#

When searching GWAS catalog PMID is needed over DOI. You can covert one to the other using:

def doi_term(doi: str) -> str:
    """Clean a DOI string by removing URL prefix."""
    doi = (
        doi
        .replace('http://', 'https://')
        .replace('https://doi.org/', '')
    )
    return f'"{doi}"[Publisher ID]'


result = entrez_api.search(
    doi_term('https://doi.org/10.3389/fcell.2021.626821'),
    database='pubmed',
    max_results=1
)
print(result.data['esearchresult']['idlist'])

['33834021']

Installation#

Requires Python 3.6+ (though only 3.7+ is tested). Install with:

pip install easy-entrez

If you wish to enable (optional, tqdm-based) progress bars use:

pip install easy-entrez[with_progress_bars]

If you wish to enable (optional, pandas-based) parsing utilities use:

pip install easy-entrez[with_parsing_utils]

Contributing#

To build the documentation locally:

pip install -e .[docs]
sphinx-build docs docs/_build
open docs/_build/index.html

Alternatives#

You might want to try:

biopython.Entrez - biopython is a heavy dependency, but probably good choice if you already use it
pubmedpy - provides interesting utilities for parsing the responses
entrez - appears to have a comparable scope but quite different API
entrezpy - this one did not work well for me (hence this package), but may have improved since