Skip to content

ValentinGenev/extract-content-from-ms-docs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scan directories for MS Word documents

The script in this repository crawls through directories, looks for MS Word documents, extracts their content into and prints it into the browser. Remember to change the Windows \ with / in the paths if you're running the script on Linux.

Requirements

  • folder named /documetns that will contain the documents in the root dir.

Known issues

  • in Windows, the script can't output .doc files properly, outputs a string of random characters (Y, B8L 1(IzZYrH9pd4n(KgVB,lDAeX)Ly5ot ebW3gp� j/gQjZTae9i5j5 fE514g7vnO( ,jV9kvvadVoTAn7jahy@ARhW.GMuO /e5sZWfPtfkA0zUw@tAm4T2j 6Q).

Resoruces

TODO:

  • craete interface that allows the upload of multiple forms;
  • extract the recursive serach into it's own function;
  • refactor the main class to allow scaling;
  • add markup parser;
  • add more supported files.

About

Looks for .doc and .docx files and extracts their text content 📄

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages