Hi there 👋

StabRise - Document Processing Solutions

Our projects

PDF DataSource for the Apache Spark

Source Code: https://github.com/StabRise/spark-pdf

Home page: https://stabrise.com/spark-pdf/

Quick Start Jupyter Notebook: https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb

The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

Key features:

Read PDF documents to the Spark DataFrame
Support read PDF files lazy per page
Support big files, up to 10k pages
Support scanned PDF files (call OCR)
No need to install Tesseract OCR, it's included in the package

ScaleDP

Source Code: https://github.com/StabRise/scaledp

Home page: https://stabrise.com/scaledp/

Quick Start Jupyter Notebook: https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb

ScaleDP is an Open-Source Library for processing documents using Apache Spark.

Key features:

Load PDF documents/Images
Extract text from PDF documents/Images
Extract images from PDF documents
OCR Images/PDF documents
Run NER on text extracted from PDF documents/Images
Visualize NER results

De-Identify

De-Identify is tool for de-identification/anonymization data

Supported formats

text
images
pdf documents
DICOM files