FullTextClient API Reference
The FullTextClient provides access to full-text content from Europe PMC articles in various formats (PDF, XML, HTML).
Important: Europe PMC Endpoint Architecture
Europe PMC has different endpoint architectures for different content types:
XML Full Text
- Primary Endpoint: Direct REST API (
/PMC{id}/fullTextXML) - Fallback Endpoint: Europe PMC FTP OA bulk archives (
https://europepmc.org/ftp/oa/) - Archive Format: Gzipped XML files organized by PMC ID ranges (e.g.,
PMC1000000_PMC1099999.xml.gz) - Availability Check: Via HEAD/GET request to REST endpoint
- Download:
- First tries REST API for individual articles
- Falls back to bulk FTP archives when REST API fails
- Automatically determines correct archive based on PMC ID
- Unpacks gzipped archives and extracts specific article XML
- Reliability: High - officially supported with robust fallback mechanism
PDF Full Text
- Endpoints: Multiple fallback endpoints
- Render endpoint:
https://europepmc.org/articles/PMC{id}?pdf=render - Backend service:
https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC{id}&blobtype=pdf - ZIP archive:
https://europepmc.org/pub/databases/pmc/pdf/OA/{dir}/PMC{id}.zip
- Render endpoint:
- Availability Check: Via GET request to render endpoint (checks content-type)
- Download: Tries endpoints in order until successful
- Reliability: Medium - depends on article availability and server status
HTML Full Text
- Endpoint: Web interface (
https://europepmc.org/article/MED/{medid}#free-full-text) - Availability Check: Cannot be checked via PMC ID alone (requires MED ID)
- Download: Via web scraping (not currently implemented)
- Reliability: Low - requires additional metadata and web scraping
Class Reference
from pyeuropepmc import FullTextClient
Constructor
FullTextClient(rate_limit_delay: float = 1.0)
Parameters:
rate_limit_delay: Delay between requests to respect API limits
Methods
check_fulltext_availability()
check_fulltext_availability(pmcid: str) -> Dict[str, bool]
Check availability of full text formats for a given PMC ID.
Parameters:
pmcid: PMC ID (with or without ‘PMC’ prefix)
Returns:
- Dictionary with format availability:
{"pdf": bool, "xml": bool, "html": bool}
Note: HTML always returns False as it requires MED ID for checking.
download_pdf_by_pmcid()
download_pdf_by_pmcid(pmcid: str, output_path: Optional[Path] = None) -> Optional[Path]
Download PDF using multiple fallback endpoints.
Parameters:
pmcid: PMC ID (with or without ‘PMC’ prefix)output_path: Where to save the PDF (optional)
Returns:
- Path to downloaded file if successful, None if failed
Download Strategy:
- Try render endpoint (
?pdf=render) - Try backend service (
ptpmcrender.fcgi) - Try ZIP archive from OA collection
download_xml_by_pmcid()
download_xml_by_pmcid(pmcid: str, output_path: Optional[Path] = None) -> Optional[Path]
Download XML full text with automatic fallback to bulk FTP archives.
This method first tries the REST API endpoint, and if that fails (404, 403, or network errors), it automatically falls back to bulk download from Europe PMC FTP OA archives.
Parameters:
pmcid: PMC ID (with or without ‘PMC’ prefix)output_path: Where to save the XML (optional)
Returns:
- Path to downloaded file if successful, None if failed
download_xml_by_pmcid_bulk()
download_xml_by_pmcid_bulk(pmcid: str, output_path: Optional[Path] = None) -> Optional[Path]
Download XML full text directly from Europe PMC FTP OA bulk archives only.
This method downloads from the Europe PMC FTP OA archives (.xml.gz files) without trying the REST API first. Useful when you specifically want to use the bulk download method or when the REST API is unavailable.
Parameters:
pmcid: PMC ID (with or without ‘PMC’ prefix)output_path: Where to save the XML (optional)
Returns:
- Path to downloaded file if successful, None if failed
Note: Archives are organized by PMC ID ranges (e.g., PMC1000000_PMC1099999.xml.gz). The method automatically determines the correct archive and unpacks the gzipped content.
get_fulltext_content()
get_fulltext_content(pmcid: str, format_type: str = "xml") -> str
Get full text content as string (XML only).
Parameters:
pmcid: PMC IDformat_type: Must be “xml” (HTML not supported via PMC ID)
Returns:
- Full text content as string
download_fulltext_batch()
download_fulltext_batch(
pmcids: List[str],
format_type: str = "pdf",
output_dir: Optional[Path] = None,
skip_errors: bool = True
) -> Dict[str, Optional[Path]]
Batch download multiple articles.
Parameters:
pmcids: List of PMC IDsformat_type: “pdf” or “xml”output_dir: Directory to save filesskip_errors: Continue on individual failures
Returns:
- Dictionary mapping PMC IDs to file paths (or None if failed)
Error Handling
The client raises FullTextError for:
- Invalid PMC IDs
- Network failures
- Access restrictions
- File system errors
Usage Examples
Basic Usage
from pyeuropepmc import FullTextClient
with FullTextClient() as client:
# Check availability
availability = client.check_fulltext_availability("3312970")
print(availability) # {"pdf": True, "xml": True, "html": False}
# Download PDF (tries multiple endpoints)
pdf_path = client.download_pdf_by_pmcid("3312970", "article.pdf")
# Download XML (via REST API)
xml_path = client.download_xml_by_pmcid("3312970", "article.xml")
Error Handling Fulltext
from pyeuropepmc import FullTextClient, FullTextError
with FullTextClient() as client:
try:
pdf_path = client.download_pdf_by_pmcid("123456")
if pdf_path:
print(f"Downloaded: {pdf_path}")
else:
print("PDF not available via any endpoint")
except FullTextError as e:
print(f"Error: {e}")
Batch Downloads
from pathlib import Path
pmcids = ["3312970", "4123456", "5789012"]
with FullTextClient() as client:
results = client.download_fulltext_batch(
pmcids,
format_type="pdf",
output_dir=Path("downloads"),
skip_errors=True
)
for pmcid, path in results.items():
if path:
print(f"✓ {pmcid}: {path}")
else:
print(f"✗ {pmcid}: Download failed")
Limitations and Considerations
- PDF Availability: Not all articles have PDFs available through any endpoint
- HTML Content: Requires MED ID and web scraping (not implemented)
- Rate Limiting: Respect API limits with appropriate delays
- Error Tolerance: Always handle potential failures gracefully
- Network Timeouts: Large files may require longer timeout settings
Best Practices
- Always use context managers (
withstatement) for proper cleanup - Handle errors gracefully - not all content is available
- Use batch operations for multiple downloads
- Check availability first for better user feedback
- Respect rate limits to avoid being blocked