Home
Tools

natudump

No description is available yet

Natudump

A tool for scraping public LegiFrance registry's naturalisation decrees.

Features

Scrapes naturalisation decrees for research purposes only (excluding naturalisation par mariage)
Supports scraping over multiple years (2000-2021)
Outputs to a specified directory
Handles PDF files using pdfminer.six

Output Formats

Text files (*.txt) containing extracted text from PDFs
TSV file (natufrance_2000_2021.tsv) for further processing and analysis

Dependencies

Selenium for web scraping
Charset Normalizer for handling non-ASCII characters in PDFs
pdfminer.six for extracting text from PDFs
Tabulate for generating TSV files
Pdfrw for concatenating and rewriting PDF files

Usage

Install required dependencies using pip install selenium charset_normalizer
Run the tool using python3 natudump.py with optional arguments:
- -o specifies output directory
- --years specifies range of years to scrape (default: 2000-2021)
- --output-directory-prefix sets prefix for output directory

Examples

python3 natudump.py -o jo --years $(seq 2000 2021) --output-directory-prefix "$PWD/"
mkdir -p jo; python3 natudump.py -o jo --years $(seq 2000 2021) --output-directory-prefix "$PWD/"

Output

The tool generates the following outputs:

Text files (*.txt) containing extracted text from PDFs
TSV file (natufrance_2000_2021.tsv) for further processing and analysis
Concatenated PDF files in catjo directory
Tarballs of scraped files in tarjo directory

Tips

For WSL systems, ensure the tool is run on a NTFS drive.
Handle errors using try-except blocks or logging mechanisms.

> Visit natudump Website <