Skip to content

Refactoring the thing: #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 0 additions & 80 deletions document-converter.php

This file was deleted.

57 changes: 57 additions & 0 deletions inc/document-converter.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
<?php
/**
* Script to extract text from MS Word documents
*
* Based on a StackOverflow answer:
* @link https://stackoverflow.com/questions/19503653/how-to-extract-text-from-word-file-doc-docx-xlsx-pptx-php
*/


interface DocumentToText
{
public static function getText(string $name): string;
public static function readDocument(string $name): string;
public static function formatText(string $text): string;
}


class DocxToHTML implements DocumentToText
{
public static function getText(string $name): string
{
$content = DocxToHTML::readDocument($name);

return DocxToHTML::formatText($content);
}

public static function readDocument(string $name): string
{
try {
$zip = new ZipArchive;
$unzip = $zip->open($name, 16);

if ($unzip !== true) {
throw new Exception("Couldn't open file: $name", 1);
}

$content = $zip->getFromName('word/document.xml');
$zip->close();

return $content;
}
catch (Exception $exception) {
return $exception->getMessage();
}
}

public static function formatText(string $text): string
{
$formattedText = $text;
$formattedText = str_replace('</w:r></w:p></w:tc><w:tc>', ' ', $formattedText);
$formattedText = str_replace('</w:r></w:p>', '\r\n', $formattedText);
$formattedText = strip_tags($formattedText);
$formattedText = str_replace('\r\n', '<br>', $formattedText);

return $formattedText;
}
}
14 changes: 7 additions & 7 deletions index.php
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@
* plain text file.
*/

include_once 'document-converter.php';
include_once 'inc/document-converter.php';


dir_digger(__DIR__ . '\documents\\', scandir(__DIR__ . '\documents\\')); // Windows path


/**
* Goes through all the nested directries
* Goes through all the nested directories
*
* @param string $dir_url
* @param array $dir_cont
Expand All @@ -29,12 +29,12 @@ function dir_digger($dir_url, $dir_cont) {
dir_digger($nested_dir_url, $nested_dir_cont);
}
else {
$file_url = $dir_url . $dir_cont[$i];
$file_obj = new Docx_Conversion($file_url);
$file_cont = $file_obj->convert_to_text();
// TODO: add the file check around here:
$fileName = $dir_url . $dir_cont[$i];
$fileContent = DocxToHTML::getText($fileName);

if ($file_cont !== 'Invalid file type') {
echo '<div style="padding: 1rem;">' . $file_cont . '</div>';
if ($fileContent !== 'Invalid file type') {
echo '<div style="padding: 1rem;">' . $fileContent . '</div>';
}
}
}
Expand Down
15 changes: 5 additions & 10 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,12 @@ The script in this repository crawls through directories, looks for MS Word docu
Remember to change the Windows `\` with `/` in the paths if you're running the script on Linux.

## Requirements
- folder named `/documetns` that will contain the documents in the root dir.
- folder named `/documents` that will contain the documents in the root dir.

## Known issues
- in Windows, the script can't output `.doc` files properly, outputs a string of random characters (`Y, B8L 1(IzZYrH9pd4n(KgVB,lDAeX)Ly5ot ebW3gp j/gQjZTae9i5j5 fE514g7vnO( ,jV9kvvadVoTAn7jahy@ARhW.GMuO /e5sZWfPtfkA0zUw@tAm4T2j 6Q`).

## Resoruces
- base on a [stackoverflow answer](https://stackoverflow.com/questions/19503653/how-to-extract-text-from-word-file-doc-docx-xlsx-pptx-php)
## Resources
- base on a [StackOverflow answer](https://stackoverflow.com/questions/19503653/how-to-extract-text-from-word-file-doc-docx-xlsx-pptx-php)

## TODO:
- craete interface that allows the upload of multiple forms;
- extract the recursive serach into it's own function;
- refactor the main class to allow scaling;
- add markup parser;
- extract the recursive search into it's own function;
- add markup parser; and
- add more supported files.