A Python library that can convert PDF to docx files. This project extracts data from PDF files through the PyMuPDF library, then uses the python-docx library to parse the layout, paragraphs, pictures, tables, etc. of the content, and finally automatically generates docx files. pd

is a Python library that can convert PDF to docx files. This project extracts data from PDF files through the PyMuPDF library, then uses the python-docx library to parse the layout, paragraphs, pictures, tables, etc. of the content, and finally automatically generates docx files.

pdf2docx functions

- Parse and create page layout

- Margins - Chapters and columns (currently supports up to two column layouts) - Headers and footers [TODO] - Parse and create paragraphs - OCR text [TODO] - Horizontal ( left to right) or vertical (bottom up) orientation text - font styles such as font, font size, bold/italic, color - text styles such as highlight, underline and strikethrough - list styles [TODO] - external hyperlinks - Paragraph horizontal alignment (left/right/centered/dispersed alignment) and front and back spacing - parse and create images - inline images - grayscale/RGB/CMYK and other color space images - images with transparent channels - floating images (lined with text) Below) - Parse and create tables - Border styles such as width and color - Cell background color - Merge cells - Cell vertical text - Hide tables with partial border lines - Nested tables - Support multi-process conversion

pdf2docx parses the table at the same time content and style, so it can also be used as a table content extraction tool.

Limitations

- Scanning PDF text recognition is not currently supported

- Only languages ​​written from left to right are supported (so Arabic is not supported) - Rotated text is not supported - Rule-based parsing cannot guarantee 100% restoration of PDF styles

installation

pip install pdf2docx

case

from pdf2docx import parse

pdf_file='/path/to/sample.pdf'docx_file='path/to/sample.docx'#convert pdf to docxparse(pdf_file,docx_file)

R un