natudump

No description is available yet
natudump logo

Natudump

A tool for scraping public LegiFrance registry's naturalisation decrees.


Features

  • Scrapes naturalisation decrees for research purposes only (excluding naturalisation par mariage)
  • Supports scraping over multiple years (2000-2021)
  • Outputs to a specified directory
  • Handles PDF files using pdfminer.six

Output Formats

  • Text files (*.txt) containing extracted text from PDFs
  • TSV file (natufrance_2000_2021.tsv) for further processing and analysis

Dependencies


Usage

  1. Install required dependencies using pip install selenium charset_normalizer
  2. Run the tool using python3 natudump.py with optional arguments:
    • -o specifies output directory
    • --years specifies range of years to scrape (default: 2000-2021)
    • --output-directory-prefix sets prefix for output directory

Examples

  • python3 natudump.py -o jo --years $(seq 2000 2021) --output-directory-prefix "$PWD/"
  • mkdir -p jo; python3 natudump.py -o jo --years $(seq 2000 2021) --output-directory-prefix "$PWD/"

Output

The tool generates the following outputs:

  • Text files (*.txt) containing extracted text from PDFs
  • TSV file (natufrance_2000_2021.tsv) for further processing and analysis
  • Concatenated PDF files in catjo directory
  • Tarballs of scraped files in tarjo directory

Tips

  • For WSL systems, ensure the tool is run on a NTFS drive.
  • Handle errors using try-except blocks or logging mechanisms.




> Visit natudump Website <