-
Notifications
You must be signed in to change notification settings - Fork 3
Convert DOCX documents
-
Install the DocSharp.Docx package from NuGet
-
Use the following code:
var converter = new DocxToRtfConverter();
converter.Convert(inputFile, outputFile); // file paths or streams; inputFile may also be a WordprocessingDocument object
To customize the default font and paragraph formatting in case they are not specified in the document, you can access the DefaultSettings
property:
converter.DefaultSettings.FontName = "Calibri";
converter.DefaultSettings.FontSize = 11; // In points (default is 12)
converter.DefaultSettings.SpaceAfterParagraph = 0; // In points (default is 8)
converter.DefaultSettings.LineSpacing = 1; // In lines (default is 1.15)
To produce an RTF string rather than directly saving to a file path or stream:
var converter = new DocxToMarkdownConverter();
string rtf = converter.ConvertToString(inputFile);
-
Install the DocSharp.Docx package from NuGet
-
Use the following code:
var converter = new DocxToMarkdownConverter();
converter.Convert(inputFile, outputFile); // file paths or streams; inputFile may also be a WordprocessingDocument object
Since many Markdown processors (e.g. GitHub) don't support base64 images, to enable images conversion you need to set the ImagesOutputFolder
and ImagesBaseUriOverride
properties. The first one specifies where images are actually saved and should be an absolute directory path, the second one is the first part of an offline or online URI which will be combined with the image file name and written in the Markdown file.
For example, to save images in the same folder of the Markdown document:
ImagesOutputFolder = Path.GetDirectoryName(inputFilePath),
ImagesBaseUriOverride = "", // will produce just the image file name, same effect as "./"
To produce a Markdown string rather than directly saving to a file path or stream:
var converter = new DocxToMarkdownConverter();
string markdown = converter.ConvertToString(inputFile);
Mathematical formulas in the DOCX document will be converted to LaTex syntax and embedded in a block like the following:
Please note that not all Markdown processors support math blocks, and that formatting and non-mathematical content are not currently supported when producing the LaTex syntax.
To extract plain unformatted text from DOCX documents you can refer to the following code:
var converter = new DocxToTxtConverter();
converter.Convert(inputFilePath, "output.txt"); // file paths or streams; inputFile may also be a WordprocessingDocument object
Text will be extracted from most elements, including paragraphs, hyperlinks, text boxes and tables.
Table layout is maintained when converting to plain text. For example, if the table has 2 rows and 3 columns the following output will be produced:
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 4 | 5 | 6 |
+---+---+---+
Multi-line paragraphs, lists and merged cells are supported, but nested tables are ignored.
It is recommended to use a monospaced font (such as Cascadia Code, Consolas or Courier) in the text editor used to view the result (e.g. Notepad or VS Code), so that the characters are aligned correctly.
The SaveTo extension method can be used to save a WordprocessingDocument object to a separate DOCX, RTF or Markdown document:
using (WordprocessingDocument document = WordprocessingDocument.Create("document.docx", WordprocessingDocumentType.Document))
{
MainDocumentPart mainPart = wordDocument.AddMainDocumentPart();
mainPart.Document = new Document();
Body body = mainPart.Document.AppendChild(new Body());
Paragraph paragraph = body.AppendChild(new Paragraph());
Run run = paragraph .AppendChild(new Run());
run.AppendChild(new Text("Add some text here."));
document.SaveTo("document.rtf", SaveFormat.Rtf);
}