Frequently Asked Questions
General Questions
What is PyEuropePMC?
PyEuropePMC is a Python library for searching and retrieving scientific literature from the Europe PMC database. It provides a simple, robust interface to access millions of research articles, preprints, and other scholarly content.
Who should use PyEuropePMC?
- Researchers conducting literature reviews
- Data scientists analyzing scientific publications
- Bioinformaticians building literature-based workflows
- Developers integrating publication data into applications
- Students learning about bibliometric analysis
Installation & Setup
How do I install PyEuropePMC?
pip install pyeuropepmc
For development:
pip install -e ".[dev]"
What Python versions are supported?
Python 3.10 and newer versions are supported.
Do I need an API key?
No, Europe PMC provides open access to their search API without requiring registration or API keys.
Usage Questions
How do I perform a basic search?
from pyeuropepmc.search import SearchClient
with SearchClient() as client:
results = client.search("CRISPR", pageSize=10)
for paper in results["resultList"]["result"]:
print(paper["title"])
How can I search for specific authors?
# Search by author
results = client.search('AUTH:"Smith J"')
# Search by author and topic
results = client.search('AUTH:"Smith J" AND cancer')
How do I handle large result sets?
Use the pagination features:
# Automatic pagination
all_results = client.fetch_all_pages("machine learning", max_results=1000)
# Manual pagination
page1 = client.search("query", pageSize=100, offset=0)
page2 = client.search("query", pageSize=100, offset=100)
dc_results = client.search(“query”, format=”dc”)
What output formats are available?
- JSON (default): Structured data, easy to process
- XML: Full metadata, Europe PMC native format
- Dublin Core: Standardized metadata format
Note: RIS and BibTeX support has been removed as of vX.Y.Z. Please use JSON, XML, or Dublin Core formats.
json_results = client.search("query", format="json")
xml_results = client.search("query", format="xml")
dc_results = client.search("query", format="dc")
How do I filter by publication date?
# Specific year
results = client.search("cancer AND PUB_YEAR:2023")
# Date range
results = client.search("cancer AND PUB_YEAR:[2020 TO 2023]")
# Recent articles (last 5 years)
from datetime import datetime
current_year = datetime.now().year
results = client.search(f"cancer AND PUB_YEAR:[{current_year-5} TO {current_year}]")
Advanced Usage
How do I implement rate limiting?
# Built-in rate limiting
client = SearchClient(rate_limit_delay=2.0) # 2 seconds between requests
# Custom throttling
import time
for query in queries:
results = client.search(query)
time.sleep(1) # Additional delay
Can I search multiple databases simultaneously?
# Search specific sources
results_pubmed = client.search("query", source="MED")
results_pmc = client.search("query", source="PMC")
results_preprints = client.search("query", source="PPR")
# Combine results
all_results = results_pubmed + results_pmc + results_preprints
How do I handle errors and retries?
from pyeuropepmc.search import SearchClient, EuropePMCError
try:
with SearchClient() as client:
results = client.search("complex query")
except EuropePMCError as e:
print(f"Search failed: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
The client includes built-in retry logic for network errors.
Performance & Limits
Are there API rate limits?
Europe PMC doesn’t publish specific rate limits, but we recommend:
- Maximum 1-2 requests per second
- Use built-in rate limiting:
SearchClient(rate_limit_delay=1.0) - Implement exponential backoff for errors
How many results can I retrieve?
- Single request: Up to 1,000 results
- Total: No hard limit, but be respectful of the service
- Use pagination for large datasets
How can I optimize performance?
# Use appropriate page sizes
results = client.search("query", pageSize=100) # vs. pageSize=10
# Request only needed fields
results = client.search("query", resultType="lite") # vs. "core"
# Use caching for repeated queries
# (implement your own caching layer)
Data & Results
What information is included in search results?
Standard fields include:
- Title, authors, journal
- Publication date, DOI, PMID
- Abstract (when available)
- Citation counts
- Full-text availability
How do I access full-text articles?
for paper in results["resultList"]["result"]:
# Check if full text is available
if paper.get("isOpenAccess") == "Y":
print(f"Open access: {paper.get('fullTextUrlList')}")
# PMC articles may have full text
if paper.get("pmcid"):
print(f"PMC ID: {paper['pmcid']}")
How accurate are citation counts?
Citation counts are updated regularly but may not be real-time. They include:
- Citations from Europe PMC corpus
- May not include all citation sources
- Updated periodically (not real-time)
Troubleshooting
Why am I getting no results?
- Check query syntax: Ensure proper quoting and operators
- Verify field names: Use
AUTH:notAUTHOR: - Check spelling: Try variations and synonyms
- Broaden search: Remove restrictive filters
# Debug: Check hit count
hit_count = client.get_hit_count("your query")
print(f"Total matches: {hit_count}")
Why am I getting timeout errors?
# Increase timeout
client = SearchClient(timeout=60) # Default is 30 seconds
# Check network connectivity
# Try simpler queries first
Why are some fields missing?
Not all articles have complete metadata:
- Older articles may lack abstracts
- Some journals don’t provide author ORCIDs
- Citation counts vary by article age and source
Use safe access:
title = paper.get("title", "No title available")
authors = paper.get("authorString", "No authors listed")
Getting Help
Where can I find more examples?
How do I report bugs or request features?
- GitHub Issues: Report bugs or request features
- Discussions: Ask questions
How can I contribute to the project?
See our Contributing Guide for:
- Code contributions
- Documentation improvements
- Bug reports
- Feature suggestions
Where is the official Europe PMC documentation?
Best Practices
Query Construction
# Good: Specific, well-structured queries
query = 'TITLE:"machine learning" AND PUB_YEAR:[2020 TO 2023]'
# Avoid: Too broad or ambiguous
query = "data" # Too broad, millions of results
Resource Management
# Good: Use context managers
with SearchClient() as client:
results = client.search("query")
# Good: Implement proper error handling
try:
results = client.search("query")
except EuropePMCError as e:
logger.error(f"Search failed: {e}")
Data Processing
# Good: Process results efficiently
def extract_key_info(results):
return [
{
"title": paper.get("title"),
"year": paper.get("pubYear"),
"citations": paper.get("citedByCount", 0)
}
for paper in results["resultList"]["result"]
]