Skip to the content.

XML Parsing Features

The FullTextXMLParser provides comprehensive capabilities for extracting structured data from Europe PMC full-text XML documents.

Overview

Quick Start

from pyeuropepmc import FullTextXMLParser

# Load XML content
with open("article.xml") as f:
    xml_content = f.read()

parser = FullTextXMLParser(xml_content)

# Extract metadata
metadata = parser.extract_metadata()
print(f"Title: {metadata['title']}")
print(f"Authors: {metadata['authors']}")

# Extract tables
tables = parser.extract_tables()
print(f"Found {len(tables)} tables")

# Convert to markdown
markdown = parser.to_markdown()

Metadata Extraction

Extract comprehensive article metadata:

from pyeuropepmc import FullTextXMLParser

parser = FullTextXMLParser(xml_content)
metadata = parser.extract_metadata()

# Available metadata fields
title = metadata['title']                    # Article title
authors = metadata['authors']                # List of author dicts
journal = metadata['journal']                # Journal info dict (title, volume, issue)
pub_date = metadata['publication_date']      # Publication date
doi = metadata['doi']                        # Digital Object Identifier
pmid = metadata['pmid']                      # PubMed ID
pmcid = metadata['pmcid']                    # PMC ID
abstract = metadata['abstract']              # Abstract text
keywords = metadata['keywords']              # List of keywords
affiliations = metadata['affiliations']      # Author affiliations

# Author information
for author in authors:
    print(f"{author['given_names']} {author['surname']}")
    print(f"  Affiliation: {author.get('affiliation', 'N/A')}")
    print(f"  Email: {author.get('email', 'N/A')}")

Metadata Fields Reference

Field Type Description
title str Article title
authors list[dict] Author information
journal dict Journal info with ‘title’, ‘volume’, ‘issue’ keys
publication_date str Publication date (ISO format)
doi str DOI identifier
pmid str PubMed ID
pmcid str PMC ID
abstract str Abstract text
keywords list[str] Article keywords
affiliations list[str] Author affiliations
article_type str Article type
pages str Page range
copyright str Copyright statement
license str License information

Table Extraction

Extract structured table data:

from pyeuropepmc import FullTextXMLParser

parser = FullTextXMLParser(xml_content)
tables = parser.extract_tables()

for i, table in enumerate(tables):
    print(f"\nTable {i+1}")
    print(f"Caption: {table['caption']}")
    print(f"Label: {table['label']}")

    # Table headers
    headers = table['headers']
    print(f"Headers: {headers}")

    # Table data
    for row in table['data']:
        print(row)

Table Structure

Each table is returned as a dictionary:

{
    'label': 'Table 1',
    'caption': 'Patient demographics and baseline characteristics',
    'headers': ['Parameter', 'Group A', 'Group B', 'P-value'],
    'data': [
        ['Age (years)', '45.3 ± 12.1', '43.8 ± 11.5', '0.23'],
        ['Gender (M/F)', '12/8', '14/6', '0.51'],
        # ...
    ]
}

Working with Tables

# Convert table to pandas DataFrame
import pandas as pd

for table in tables:
    df = pd.DataFrame(table['data'], columns=table['headers'])
    print(f"\n{table['caption']}")
    print(df)

    # Save to CSV
    df.to_csv(f"table_{table['label']}.csv", index=False)

Reference Extraction

Extract bibliography and citations:

from pyeuropepmc import FullTextXMLParser

parser = FullTextXMLParser(xml_content)
references = parser.extract_references()

for i, ref in enumerate(references, 1):
    print(f"\n[{i}] {ref['title']}")
    print(f"    Authors: {ref['authors']}")
    print(f"    Journal: {ref['journal']}")
    print(f"    Year: {ref['year']}")
    print(f"    DOI: {ref.get('doi', 'N/A')}")
    print(f"    PMID: {ref.get('pmid', 'N/A')}")

Reference Structure

{
    'id': 'ref1',
    'title': 'Original research title',
    'authors': 'Smith J, Jones M, Brown L',
    'journal': 'Nature',
    'year': '2020',
    'volume': '123',
    'pages': '45-52',
    'doi': '10.1038/...',
    'pmid': '12345678'
}

Format Conversion

Convert to Plaintext

Extract clean plaintext from XML:

from pyeuropepmc import FullTextXMLParser

parser = FullTextXMLParser(xml_content)

# Get full plaintext
plaintext = parser.to_plaintext()

# Save to file
with open("article.txt", "w", encoding="utf-8") as f:
    f.write(plaintext)

Convert to Markdown

Generate formatted Markdown:

parser = FullTextXMLParser(xml_content)

# Generate markdown
markdown = parser.to_markdown()

# Save to file
with open("article.md", "w", encoding="utf-8") as f:
    f.write(markdown)

Markdown includes:

Example Markdown Output

# Machine Learning in Genomics: A Review

**Authors:** John Smith¹, Jane Doe², Mary Johnson¹

¹ Department of Computer Science, Stanford University
² Department of Biology, MIT

## Abstract

Machine learning has revolutionized genomics research...

## Introduction

The application of computational methods to biological data...

### Background

Recent advances in sequencing technology...

## Methods

### Data Collection

We collected genomic data from...

| Sample ID | Tissue Type | Read Count |
|-----------|-------------|------------|
| S001      | Brain       | 1,234,567  |
| S002      | Liver       | 987,654    |

## References

1. Smith et al. (2020). "Previous work." *Nature*, 123:45-52.

Section Extraction

Extract specific sections from the article:

from pyeuropepmc import FullTextXMLParser

parser = FullTextXMLParser(xml_content)

# Extract introduction
intro = parser.extract_section('introduction')

# Extract methods
methods = parser.extract_section('methods')

# Extract all sections
sections = parser.extract_all_sections()

for section in sections:
    print(f"\n{section['title']}")
    print(f"{section['content'][:200]}...")  # First 200 chars

Schema Coverage Validation

Analyze how well the parser recognizes XML elements:

from pyeuropepmc import FullTextXMLParser

parser = FullTextXMLParser(xml_content)

# Validate schema coverage
coverage = parser.validate_schema_coverage()

print(f"Overall coverage: {coverage['coverage_percentage']:.1f}%")
print(f"Recognized elements: {coverage['recognized_count']}")
print(f"Unrecognized elements: {coverage['unrecognized_count']}")

# See unrecognized elements
if coverage['unrecognized_elements']:
    print("\nUnrecognized elements:")
    for elem, count in coverage['unrecognized_elements'].items():
        print(f"  {elem}: {count} occurrences")

Coverage Report Structure

{
    'coverage_percentage': 85.5,
    'recognized_count': 342,
    'unrecognized_count': 58,
    'total_elements': 400,
    'recognized_elements': {
        'article-title': 1,
        'contrib': 5,
        'p': 45,
        'table': 3,
        # ...
    },
    'unrecognized_elements': {
        'custom-meta': 12,
        'inline-formula': 8,
        # ...
    }
}

Custom Element Patterns

Customize XML element recognition for specialized XML schemas:

from pyeuropepmc import FullTextXMLParser, ElementPatterns

# Create custom element patterns
custom_patterns = ElementPatterns(
    title_paths=['./front/article-meta/title-group/article-title'],
    author_paths=['./front/article-meta/contrib-group/contrib'],
    abstract_paths=['./front/article-meta/abstract'],
    # Add custom patterns for specialized elements
    custom_patterns={
        'supplementary': './back/app-group/app',
        'data_availability': './back/sec[@sec-type="data-availability"]'
    }
)

# Use custom patterns
parser = FullTextXMLParser(xml_content, element_patterns=custom_patterns)
metadata = parser.extract_metadata()

Available Pattern Groups

ElementPatterns(
    # Metadata patterns
    title_paths=[...],
    author_paths=[...],
    abstract_paths=[...],
    keywords_paths=[...],

    # Content patterns
    body_paths=[...],
    section_paths=[...],
    paragraph_paths=[...],

    # Table patterns
    table_paths=[...],
    table_caption_paths=[...],

    # Reference patterns
    ref_list_paths=[...],
    ref_paths=[...],

    # Custom patterns (dict)
    custom_patterns={}
)

Advanced Examples

Example 1: Extract All Data

Complete extraction workflow:

from pyeuropepmc import FullTextXMLParser

parser = FullTextXMLParser(xml_content)

# Extract everything
data = {
    'metadata': parser.extract_metadata(),
    'tables': parser.extract_tables(),
    'references': parser.extract_references(),
    'plaintext': parser.to_plaintext(),
    'coverage': parser.validate_schema_coverage()
}

# Save to JSON
import json
with open("article_data.json", "w") as f:
    json.dump(data, f, indent=2)

Example 2: Batch Processing

Process multiple XML files:

import os
from pyeuropepmc import FullTextXMLParser

xml_dir = "./xml_files"
output_dir = "./processed"

for filename in os.listdir(xml_dir):
    if filename.endswith('.xml'):
        # Read XML
        with open(os.path.join(xml_dir, filename)) as f:
            xml_content = f.read()

        # Parse
        parser = FullTextXMLParser(xml_content)

        # Extract and save metadata
        metadata = parser.extract_metadata()
        pmcid = metadata.get('pmcid', filename.replace('.xml', ''))

        # Save markdown
        markdown = parser.to_markdown()
        with open(f"{output_dir}/{pmcid}.md", "w") as f:
            f.write(markdown)

        print(f"Processed: {pmcid}")

Example 3: Table Export to Excel

Export all tables to Excel:

from pyeuropepmc import FullTextXMLParser
import pandas as pd

parser = FullTextXMLParser(xml_content)
tables = parser.extract_tables()

# Create Excel writer
with pd.ExcelWriter('article_tables.xlsx') as writer:
    for i, table in enumerate(tables):
        # Convert to DataFrame
        df = pd.DataFrame(table['data'], columns=table['headers'])

        # Write to Excel sheet
        sheet_name = f"Table_{i+1}"
        df.to_excel(writer, sheet_name=sheet_name, index=False)

        # Add caption as note (requires openpyxl)
        worksheet = writer.sheets[sheet_name]
        worksheet.insert_rows(0)
        worksheet['A1'] = table['caption']

Example 4: Citation Network Analysis

Build citation network from references:

from pyeuropepmc import FullTextXMLParser
import networkx as nx

# Parse multiple papers
papers_data = []
for xml_file in xml_files:
    with open(xml_file) as f:
        parser = FullTextXMLParser(f.read())
        papers_data.append({
            'metadata': parser.extract_metadata(),
            'references': parser.extract_references()
        })

# Build citation graph
G = nx.DiGraph()

for paper in papers_data:
    pmid = paper['metadata']['pmid']
    G.add_node(pmid, title=paper['metadata']['title'])

    for ref in paper['references']:
        if ref.get('pmid'):
            G.add_edge(pmid, ref['pmid'])

# Analyze
print(f"Papers: {G.number_of_nodes()}")
print(f"Citations: {G.number_of_edges()}")

# Find most cited papers
in_degree = dict(G.in_degree())
most_cited = sorted(in_degree.items(), key=lambda x: x[1], reverse=True)[:10]

Example 5: Schema Coverage Analysis

Analyze parser coverage across multiple files:

from pyeuropepmc import FullTextXMLParser
from collections import defaultdict

all_unrecognized = defaultdict(int)

for xml_file in xml_files:
    with open(xml_file) as f:
        parser = FullTextXMLParser(f.read())
        coverage = parser.validate_schema_coverage()

        # Aggregate unrecognized elements
        for elem, count in coverage['unrecognized_elements'].items():
            all_unrecognized[elem] += count

# Report most common unrecognized elements
print("Most common unrecognized elements across all files:")
sorted_unrecognized = sorted(all_unrecognized.items(), key=lambda x: x[1], reverse=True)
for elem, count in sorted_unrecognized[:20]:
    print(f"  {elem}: {count} occurrences")

Performance Tips

1. Reuse Parser Instances

# ❌ Creating new parser for each operation
xml_content = load_xml()
parser1 = FullTextXMLParser(xml_content)
metadata = parser1.extract_metadata()
parser2 = FullTextXMLParser(xml_content)
tables = parser2.extract_tables()

# ✅ Reuse parser instance
parser = FullTextXMLParser(xml_content)
metadata = parser.extract_metadata()
tables = parser.extract_tables()
references = parser.extract_references()

2. Extract Only Needed Data

# ❌ Extract everything if you only need metadata
parser = FullTextXMLParser(xml_content)
metadata = parser.extract_metadata()
tables = parser.extract_tables()  # Unnecessary
references = parser.extract_references()  # Unnecessary

# ✅ Extract only what you need
parser = FullTextXMLParser(xml_content)
metadata = parser.extract_metadata()

3. Use Schema Validation Wisely

# Schema validation is computationally expensive
# Only run when needed (e.g., during development)

if debug_mode:
    coverage = parser.validate_schema_coverage()
    if coverage['coverage_percentage'] < 80:
        print("Warning: Low coverage")

Error Handling

from pyeuropepmc import FullTextXMLParser
from xml.etree.ElementTree import ParseError

try:
    parser = FullTextXMLParser(xml_content)
    metadata = parser.extract_metadata()

    if not metadata.get('title'):
        print("Warning: No title found")

except ParseError as e:
    print(f"Invalid XML: {e}")

except Exception as e:
    print(f"Parser error: {e}")

See Also


Next Steps: