PyEuropePMC Features
**β¨ Explore what PyEuropePMC can do** - Comprehensive feature overview and workflows
[π Search](search/) β’ [π Full-Text](fulltext/) β’ [π¬ Parsing](parsing/) β’ [β¬
οΈ Back to Docs](/pyEuropePMC/)
π Core Features
Search
Query the Europe PMC database with powerful search capabilities
- Advanced query syntax support
- Boolean operators (AND, OR, NOT)
- Field-specific searches
- Date range filtering
- Citation count sorting
- Pagination for large result sets
- Multiple output formats (JSON, XML, Dublin Core)
Quick Example:
from pyeuropepmc import SearchClient
with SearchClient() as client:
results = client.search("cancer AND therapy", pageSize=50, sort="CITED desc")
Full-Text Retrieval
Download complete article content in multiple formats
- PDF downloads from open access articles
- XML full-text retrieval
- HTML content access
- Bulk FTP downloads for large datasets
- Progress tracking with callbacks
- Automatic retry and error handling
Quick Example:
from pyeuropepmc import FullTextClient
with FullTextClient() as client:
pdf_path = client.download_pdf_by_pmcid("PMC1234567")
xml_content = client.download_xml_by_pmcid("PMC1234567")
XML Parsing
Extract structured data from full-text XML documents
- Metadata extraction - Title, authors, journal, dates, DOI, keywords
- Table extraction - Extract tables with headers, captions, and data
- Reference extraction - Bibliography with complete citations
- Format conversion - Convert to plaintext or Markdown
- Section extraction - Get structured body sections
- Schema coverage validation - Analyze XML element recognition
- Flexible configuration - Customize element patterns
Quick Example:
from pyeuropepmc import FullTextXMLParser, ElementPatterns
parser = FullTextXMLParser(xml_content)
# Extract metadata
metadata = parser.extract_metadata()
# Extract tables
tables = parser.extract_tables()
# Convert to markdown
markdown = parser.to_markdown()
# Validate schema coverage
coverage = parser.validate_schema_coverage()
print(f"Coverage: {coverage['coverage_percentage']:.1f}%")
π§ Query Builder
Advanced fluent API for building complex search queries with type safety
- Type-safe field specifications (150+ searchable fields)
- Fluent method chaining with boolean logic (AND/OR/NOT)
- Citation count and date range filtering
- Query validation using CoLRev search-query package
- Cross-platform query translation (PubMed, Web of Science, etc.)
- Load/save queries in standard JSON format
- Query evaluation with recall/precision metrics
- Systematic review integration with PRISMA compliance
Quick Example:
from pyeuropepmc import QueryBuilder
qb = QueryBuilder()
query = (qb
.keyword("cancer", field="title")
.and_()
.citation_count(min_count=50)
.and_()
.date_range(start_year=2020)
.build())
# Result: "(TITLE:cancer) AND (CITED:[50 TO *]) AND (PUB_YEAR:[2020 TO *])"
π Systematic Review Tracking
PRISMA/Cochrane-compliant search logging and audit trails
- Complete systematic review workflow support
- Search log integration with
log_to_search()method - Raw results saving for reproducibility
- PRISMA flow diagram data generation
- Audit trails for research transparency
Quick Example:
from pyeuropepmc import QueryBuilder
from pyeuropepmc.utils.search_logging import start_search
log = start_search("Cancer Review", executed_by="Researcher")
qb = QueryBuilder().keyword("cancer").and_().field("open_access", True)
qb.log_to_search(log, filters={"open_access": True}, results_returned=100)
π Feature Comparison
| Feature | SearchClient | FullTextClient | FullTextXMLParser | FTPDownloader | QueryBuilder |
|---|---|---|---|---|---|
| Search Europe PMC | β | - | - | - | β |
| Build Complex Queries | - | - | - | - | β |
| Type-Safe Fields | - | - | - | - | β |
| Query Validation | - | - | - | - | β |
| Query Translation | - | - | - | - | β |
| Download PDFs | - | β | - | β | - |
| Download XML | - | β | - | - | - |
| Parse XML | - | - | β | - | - |
| Extract Metadata | - | - | β | - | - |
| Extract Tables | - | - | β | - | - |
| Bulk Downloads | - | - | - | β | - |
| Systematic Review Logging | - | - | - | - | β |
| Caching | β | β | - | - | - |
| Progress Tracking | - | β | - | β | Β |
π Common Workflows
Workflow 1: Advanced Query β Search β Parse
from pyeuropepmc import QueryBuilder, SearchClient, FullTextXMLParser
# Step 1: Build complex query with QueryBuilder
qb = QueryBuilder()
query = (qb
.keyword("machine learning", field="title")
.and_()
.citation_count(min_count=25)
.and_()
.date_range(start_year=2020)
.build())
# Step 2: Search with the query
with SearchClient() as client:
results = client.search(query, pageSize=20, sort="CITED desc")
# Step 3: Process results
for paper in results['resultList']['result']:
if paper.get('pmcid'):
# Download and parse XML
xml_content = client.get_fulltext_xml(paper['pmcid'])
parser = FullTextXMLParser(xml_content)
metadata = parser.extract_metadata()
print(f"High-impact paper: {metadata['title']}")
Workflow 2: Systematic Review with Audit Trail
from pyeuropepmc import QueryBuilder
from pyeuropepmc.utils.search_logging import start_search
# Start systematic review
log = start_search("ML in Biology Review", executed_by="Researcher Name")
# Build comprehensive search strategy
qb = QueryBuilder()
comprehensive_query = (qb
.keyword("machine learning")
.and_()
.keyword("biology")
.and_()
.field("open_access", True)
.and_()
.date_range(start_year=2019)
.build())
# Execute and log search
with SearchClient() as client:
results = client.search(comprehensive_query, pageSize=100)
# Log for systematic review compliance
qb.log_to_search(
search_log=log,
filters={"open_access": True, "date_range": "2019+"},
results_returned=len(results['resultList']['result']),
notes="Comprehensive ML in biology search"
)
# Save review log
log.save("systematic_review_log.json")
Workflow 3: Advanced Search β Filter β Extract
from pyeuropepmc import SearchClient, FullTextXMLParser
with SearchClient() as client:
# Advanced search with filters
results = client.search(
query="cancer AND (therapy OR treatment)",
sort="CITED desc",
pageSize=100,
resultType="core"
)
# Filter for high-impact papers
high_impact = [
paper for paper in results['resultList']['result']
if paper.get('citedByCount', 0) > 50 and paper.get('pmcid')
]
# Extract detailed information
for paper in high_impact:
xml_content = client.get_fulltext_xml(paper['pmcid'])
parser = FullTextXMLParser(xml_content)
# Analyze...
π Feature Matrix
Search Features
| Capability | Supported | Notes |
|---|---|---|
| Keyword search | β | Full-text search across all fields |
| Boolean operators | β | AND, OR, NOT |
| Field-specific | β | Search specific fields (author, title, etc.) |
| Date filtering | β | Publication date ranges |
| Citation sorting | β | Sort by citation count |
| Pagination | β | Handle large result sets |
| Multiple formats | β | JSON, XML, Dublin Core |
Full-Text Features
| Capability | Supported | Notes |
|---|---|---|
| PDF download | β | Open access articles only |
| XML download | β | JATS/NLM XML format |
| HTML content | β | HTML representation |
| Bulk FTP | β | Efficient for large datasets |
| Progress tracking | β | Real-time progress callbacks |
| Auto-retry | β | Robust error handling |
Parsing Features
| Capability | Supported | Notes |
|---|---|---|
| Metadata extraction | β | Title, authors, journal, dates, etc. |
| Table extraction | β | Structured table data |
| Reference extraction | β | Complete bibliography |
| Plaintext conversion | β | Full article text |
| Markdown conversion | β | Formatted markdown |
| Schema validation | β | Coverage analysis |
| Custom patterns | β | Flexible configuration |
| Multiple XML schemas | β | JATS, NLM, custom |
π Learning Resources
By Feature
- Search β Search Documentation
- Full-Text β Full-Text Documentation
- Parsing β Parsing Documentation
- Caching β Caching Documentation
By Use Case
- Literature Review β Examples: Literature Review
- Data Mining β Examples: Text Mining
- Meta-Analysis β Examples: Meta-Analysis
By Skill Level
- Beginner β Getting Started
- Intermediate β Examples
- Advanced β Advanced Guide
π‘ Best Practices
Performance
- Use caching for repeated queries
- Implement bulk operations for large datasets
- Set appropriate page sizes (50-100 for most cases)
- Use FTP downloads for bulk PDF retrieval
Error Handling
- Always use context managers (
withstatements) - Implement retry logic for network operations
- Check for PMC ID availability before downloads
- Validate XML before parsing
Rate Limiting
- Respect Europe PMC API rate limits
- Use delays between bulk operations
- Cache results to minimize API calls
- Consider FTP for large-scale downloads
π Whatβs Next?
Explore each feature in detail:
- Search Features - Master the search API
- Full-Text Retrieval - Download article content
- XML Parsing - Extract structured data
- Caching - Optimize performance
Or jump to:
- π API Reference for complete API documentation
- π― Examples for working code
- βοΈ Advanced Guide for power user features
π Related Sections
| Section | Why Visit? |
|---|---|
| π Getting Started | Installation and basics |
| π API Reference | Complete method documentation |
| π― Examples | Working code samples |
| βοΈ Advanced | Power user features |
**[β¬ Back to Top](#pyeuropepmc-features)** β’ [β¬
οΈ Back to Main Docs](/pyEuropePMC/)