Skip to content

Commit a9a37dc

Browse files
authored
Creating docx to md feature
Creating docx to md feature
2 parents 0b36cb0 + 1e7754e commit a9a37dc

File tree

18 files changed

+601
-144
lines changed

18 files changed

+601
-144
lines changed

Readme.md

Lines changed: 53 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,17 @@
22

33
## Overview
44

5-
Simple and straight forward Python utility that converts a Markdown file (`.md`) to a Microsoft Word document (`.docx`). It supports multiple Markdown elements, including headings, bold and italic text, both unordered and ordered lists, and many more.
5+
Simple and straight forward Python utility that converts Markdown files (`.md`) to Microsoft Word documents (`.docx`) and vice versa. It supports multiple Markdown elements, including headings, bold and italic text, both unordered and ordered lists, and many more.
66

7+
## Word to Markdown Conversion Example:
8+
#### Input .docx file:
9+
![image](https://github.com/user-attachments/assets/2891ebdf-ff36-4fd5-af2f-b35413264b06)
10+
11+
#### Output .md file:
12+
![image](https://github.com/user-attachments/assets/e46c096b-762e-4f0c-a0ab-f81c3069a533)
13+
14+
15+
## Markdown to Word Conversion Example:
716
#### Input .md file:
817
![image](https://github.com/user-attachments/assets/c2325e52-05a7-4e11-8f28-4eeb3d8c06f5)
918

@@ -13,18 +22,22 @@ Simple and straight forward Python utility that converts a Markdown file (`.md`)
1322

1423
## Features
1524

16-
- Converts Markdown headers (`#`, `##`, `###`) to Word document headings.
17-
- Supports bold and italic text formatting.
18-
- Converts unordered (`*`, `-`) and ordered (`1.`, `2.`) lists.
19-
- Handles paragraphs with mixed content.
25+
- Bi-directional conversion between Markdown and Word documents
26+
- Handles various programming languages code given in word doc like python, ruby and more.
27+
- Converts Markdown headers (`#`, `##`, `###`) to Word document headings and back
28+
- Supports bold and italic text formatting
29+
- Converts unordered (`*`, `-`) and ordered (`1.`, `2.`) lists
30+
- Handles paragraphs with mixed content
31+
- Preserves document structure during conversion
2032

2133
## Prerequisites
2234

2335
You need to have Python installed on your system along with the following libraries:
2436

25-
- `markdown` for converting Markdown to HTML.
26-
- `python-docx` for creating and editing Word documents.
27-
- `beautifulsoup4` for parsing HTML.
37+
- `markdown` for converting Markdown to HTML
38+
- `python-docx` for creating and editing Word documents
39+
- `beautifulsoup4` for parsing HTML
40+
- `mammoth` for converting Word to HTML
2841

2942
Sure, let's enhance your instructions for clarity and completeness:
3043

@@ -74,7 +87,33 @@ This code will create a file named `amazon_case_study.docx`, which is the conver
7487

7588
---
7689

77-
This should make it easier to understand and follow the steps. Let me know if you need any more help or further enhancements!
90+
#### For Converting Word to Markdown
91+
Use the `word_to_markdown()` function to convert your Word document to Markdown:
92+
93+
```python
94+
word_to_markdown(word_file, markdown_file)
95+
```
96+
97+
- `word_file`: The path to the Word document you want to convert
98+
- `markdown_file`: The desired path and name for the output Markdown file
99+
100+
101+
Here's a complete example:
102+
103+
```python
104+
from md2docx_python.src.docx2md_python import word_to_markdown
105+
106+
# Define the paths to your files
107+
word_file = "sample_files/test_document.docx"
108+
markdown_file = "sample_files/test_document_output.md"
109+
110+
# Convert the Word document to a Markdown file
111+
word_to_markdown(word_file, markdown_file)
112+
```
113+
114+
This code will create a file named `test_document_output.md`, which is the conversion of `test_document.docx` to the Markdown format.
115+
116+
---
78117

79118
## Why this repo and not others ?
80119

@@ -108,6 +147,11 @@ Here are some reasons why this repo might be considered better or more suitable
108147
### 8. **Privacy**
109148
- If you are working in a corporate firm and you want to convert your markdown files to word and you use a online tool to do it then there are chances that they will store your file which can cause to a vital information leak of your company. With use of this repo you can easily do the conversion in your own system.
110149

150+
### 9. **Bi-directional Conversion**
151+
- **Complete Workflow**: Convert documents in both directions, allowing for round-trip document processing
152+
- **Format Preservation**: Maintains formatting and structure when converting between formats
153+
- **Flexibility**: Easily switch between Markdown and Word formats based on your needs
154+
111155
### Comparison to Other Scripts
112156
- **Feature Set**: Some scripts may lack comprehensive support for Markdown features or may not handle lists and text formatting well.
113157
- **Performance**: Depending on the implementation, performance might vary. This script is designed to be efficient for typical Markdown files.
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
from docx import Document
2+
import re
3+
4+
5+
def word_to_markdown(word_file, markdown_file):
6+
"""
7+
Convert a Word document to Markdown format
8+
9+
Args:
10+
word_file (str): Path to the input Word document
11+
markdown_file (str): Path to the output Markdown file
12+
"""
13+
# Open the Word document
14+
doc = Document(word_file)
15+
markdown_content = []
16+
17+
for paragraph in doc.paragraphs:
18+
# Skip empty paragraphs
19+
if not paragraph.text.strip():
20+
continue
21+
22+
# Get paragraph style
23+
style = paragraph.style.name.lower()
24+
25+
# Handle code blocks
26+
if style.startswith("code block") or style.startswith("source code"):
27+
markdown_content.append(f"```\n{paragraph.text.strip()}\n```\n\n")
28+
continue
29+
30+
# Handle headings
31+
if style.startswith("heading"):
32+
level = style[-1] # Get heading level from style name
33+
markdown_content.append(f"{'#' * int(level)} {paragraph.text.strip()}\n")
34+
continue
35+
36+
# Handle lists
37+
if style.startswith("list bullet"):
38+
markdown_content.append(f"* {paragraph.text.strip()}\n")
39+
continue
40+
if style.startswith("list number"):
41+
markdown_content.append(f"1. {paragraph.text.strip()}\n")
42+
continue
43+
44+
# Handle regular paragraphs with formatting
45+
formatted_text = ""
46+
for run in paragraph.runs:
47+
text = run.text
48+
if text.strip():
49+
# Handle inline code (typically monospace font)
50+
if run.font.name in [
51+
"Consolas",
52+
"Courier New",
53+
"Monaco",
54+
] or style.startswith("code"):
55+
if "\n" in text:
56+
text = f"```\n{text}\n```"
57+
else:
58+
text = f"`{text}`"
59+
# Apply bold
60+
elif run.bold:
61+
text = f"**{text}**"
62+
# Apply italic
63+
elif run.italic:
64+
text = f"*{text}*"
65+
# Apply both bold and italic
66+
elif run.bold and run.italic:
67+
text = f"***{text}***"
68+
formatted_text += text
69+
70+
if formatted_text:
71+
markdown_content.append(f"{formatted_text}\n")
72+
73+
# Add an extra newline after paragraphs
74+
markdown_content.append("\n")
75+
76+
# Write to markdown file
77+
with open(markdown_file, "w", encoding="utf-8") as f:
78+
f.writelines(markdown_content)
79+
80+
81+
def clean_markdown_text(text):
82+
"""
83+
Clean and normalize markdown text
84+
85+
Args:
86+
text (str): Text to clean
87+
88+
Returns:
89+
str: Cleaned text
90+
"""
91+
# Remove multiple spaces
92+
text = re.sub(r"\s+", " ", text)
93+
# Remove multiple newlines
94+
text = re.sub(r"\n\s*\n\s*\n", "\n\n", text)
95+
return text.strip()
6.93 KB
Binary file not shown.

0 commit comments

Comments
 (0)