Skip to content

Releases: TeamMsgExtractor/msg-extractor

Version 0.46.1

10 Nov 02:16
0c4ad35
Compare
Choose a tag to compare

v0.46.1

  • [TeamMsgExtractor #394] Fix typo in props that caused the wrong number of bytes to be given to a struct.

Version 0.46.0

08 Nov 14:32
8461f10
Compare
Choose a tag to compare

v0.46.0

  • [TeamMsgExtractor #95] Adjusted the overrideEncoding property of MSGFile to allow automatic encoding detection. Simply set the property to the string "chardet" and, assuming the chardet module is installed, it will analyze a number of the strings to try and form a consensus about the encoding. This will ignore the specified encoding only if if successfully detects. Otherwise it will log a warning and fall back to the default behavior.
  • [TeamMsgExtractor #387] Changed extract_msg.utils.decodeRfc2047 to not throw decoding errors if the content given is not ASCII.
  • [TeamMsgExtractor #387] Changed header parsing policy to email.policy.compat32 to prevent partial parsing of quoted header fields.
  • [TeamMsgExtractor #388] Updated documentation of MSGFile.export to specify that updated fields on an MSGFile instance (and it's subclasses) will not be reflected in the result of the function. Many of the functions do use the newest version of a cached_property, but this is not one of them.
  • Removed methods deprecated in v0.45.0.
  • Changed the base class of EntryID from no base class to abc.ABC.
  • Added position property to EntryID to tell how many bytes were used to create the EntryID.
  • Added a number of properties to MSGFile from [MS-OXCMSG].
  • Moved some properties down to MessageBase from it's subclasses.
  • Added support for Journal objects.
  • Changed internal code of PermanentEntryID to correctly parse the data. Previously the distinguished name did not actually end at the null character, instead ending at the end of the bytes provided. If there was trailing data, it would be captured inadvertently.
  • Finished definition for StoreObjectEntryID.
  • Added new keyword arguments for MSG files: dateFormat and datetimeFormat. These allow the user to easily override the strings being used for format dates and dates that include a time component, respectively.
    • In unifying all the formats into 2 options, you may notice that some will look a bit different starting from this version, as there was an unfortunately large amount of variation.
  • Fixed code for MessageBase.parsedDate which could have incorrect values.
  • Fixed issues with MessageBase.date and related things either being incorrectly documented or doing things that are not specified by the documentation. It was supposed to have been changed to use datetime objects, but it was still using strings.
  • Removed unused function extract_msg.utils.isEmptyString.
  • Removed unused function extract_msg.utils.properHex.
  • Added a getStream helper function to CustomAttachmentHandler.
  • Added a getStreamAs helper function to CustomAttachmentHandler.
  • Added new custom attachment handler for journal-associated attachments.
  • Changed EntryID.autoCreate to return None if given None or empty bytes.
  • Changed EntryID.autoCreate to raise a FeatureNotImplemented exception if no valid entry ID class is found.
  • Fix typing annotations for CustomAttachmentHandler.
  • Removed unneeded imapclient dependency.
  • Changed getJson to have values be null if they aren't found rather than an empty string.
  • Implemented the getJson method correctly for a number of classes.
  • Changed Task.percentComplete to always return a float.
  • Changed the NotImplementedError for custom attachment handler not being found to FeatureNotImplemented. Additionally, changed the error message to specify the CLSID found on the attachment to better enable people to report issues.
  • Changed code for Recipient and MessageBase that makes it rely on MessageBase.recipientTypeClass to determine the class to use for the recipientType property. Adjusted the typing of Recipient to have it reflect the type that will be used.
  • Correctly changed the returned value for ResponseStatus.fromIter to actually return a List instead of a set.
  • Filled out typing information for a significant portion of the module where variables or functions were missing it. This includes the entirety of the constants submodule.
  • Corrected a number of minor issues.
  • Extended values for DVAspect enum.
  • Added new enums to go with parsing for OLEPresentationStream.
  • Changed NNTPNewsgroupFolderEntryID.newsgroupName to bytes instead of string since it is ANSI.
  • Fixed an issue that would cause headers to fail to parse properly if the header text starts with "Microsoft Mail Internet Headers Version 2.0" which is common on some MSG files. This is fixed by stripping that from the beginning before actually parsing the text. This is to circumvent CPython issue #85329, confirmed to still exist in at least some of the supported Python versions.
  • Added listDir and slistDir as methods to AttachmentBase, Recipient, and Named. These always exclude the prefix, returning as if their directory is the root of the object. This allows the named to be directly used for accessing those files.
  • Numerous spelling fixes in docstrings, comments, and exceptions.
  • Reduced the amount of initialization performed by MessageBase. Much of this initialization was there from before a lot of stuff changed to cached_property and a number of internal variables were being used. Now all of the relevant variables will be initialized by the way they are accessed.
  • Added new exception DependencyError.
  • Changed the errors for missing optional dependencies from ImportError to DependencyError.
  • Removed all instances of the rawData property in favor of the toBytes method. For now, many of these will simply return the raw data used, specifically those that are still unmodifiable. Any whose properties have the ability to be modified will have properly implemented versions. These classes also allow None to be passed as the value for their data, which will be the default if no arguments have been passed to the constructor. If no arguments or None is given as the data, it will create a new instance with default values. This is all in an effort to move towards the ability to create new MSG files and the MSGWriter class. All toBytes methods will either exclusively return bytes or will return None to specify that the structure isn't valid to convert to bytes. Structures that may be invalid will be annotated as Optional[bytes] for the return type.
    • Additionally, these objects will also support the __bytes__ method. If the object returned is not bytes, the method will throw a TypeError.
  • Removed the individual PropBase flag properties and changed the main flags property to return an enum containing the flags.
  • Changed various data structs to allow modification and creation of new instances for writing to an MSG file.
  • Changed TZRule to use unsigned values where applicable.
  • Changed TZRule to require the 14 null bytes (I commented it out completely on accident instead of swapping it to a plain read). It now logs a warning about the bytes not being null.
  • Removed unneeded function windowsUnicode.
  • Moved FixedLengthProperty.parseType to the private API. This was not intended for external use anyways, so leaving it as public API didn't make sense.
  • Fixed check for type in ContactAddressEntryID being the wrong value.
  • Modified inputToBytes to support objects with the __bytes__ method. If the method exists and works then it will be used as a last resort.
  • Modified OleWriter to accept objects with a __bytes__ method for the data to use for an entry.
  • Added __bytes__ method to MSGFile. This is equivalent to calling MSGFile.exportBytes.

Verison 0.45.0

13 Aug 00:03
25db76d
Compare
Choose a tag to compare

v0.45.0

  • BREAKING: Changed parsing of string multiple properties to remove the trailing null byte. This will cause the output of parsing them to differ.
  • Updated typing information for some functions and classes.
  • Fixed a bug with MessageSignedBase.attachments that would cause it to return None instead of an empty list if the number of normal attachments was 0 was the error behavior was set to ignore violations of the standard.
  • Updated MessageSignedBase.attachments to use functools.cached_property instead of property.
  • Fixed spelling errors in some exception strings.
  • Made NamedPropertyBase a subclass of abc.ABC.
  • Cleaned up some of the code for named properties to remove unused variables and remove inefficient code.
  • Changed PropBase to be a subclass of abc.ABC.
  • Added detailed versioning info to the README.
  • Deprecated many private functions, including methods on many of the classes. Of primary note are _getStream and _getStringStream, which have been moved to the public API as getStream and getStringStream. Any deprecated functions still exist and will forward to a public API function if they are not being removed. Additionally, all internal usage of them has been removed. This change is one of the big preparations that is needed for the 1.0.0 release.
    • As mentioned, a number of these deprecated functions have been moved to the public API. It is recommended that you run tests with your code after enabling deprecation warnings to see what should be changed.
  • Removed items deprecated in or before 0.42.0.
  • Changed the API for the private method _genRecipient. This is not intended for use outside of the module except for subclasses. The change removed the allowance of ints for the second argument, requiring that it be a valid enum type.
  • Convert many enum types to IntEnum.
  • Extended functionality of PropertiesStore to allow for integer property names and getting a property based on just the ID. You can also get a list of all properties that use a given ID.
  • Added new function PropertiesStore.getProperties which gets a list of all properties matching the property ID. Return type is a list of PropBase instances.
  • Added new function PropertiesStore.getValue which looks for the first matching FixedLengthProp and returns the value from it.
  • Improved internal code related to getting a property with a potentially unknown type.
  • Added a number of entirely new functions to the public API on MSGFile, AttachmentBase, PropertiesStore, and Recipient objects:
    • getMultipleBinary: Gets a multiple binary property as a list of bytes objects.
    • getSingleOrMultipleBinary: A combination of getStream and getMultipleBinary which prefers a single binary stream. Returns a single bytes object or a list of bytes objects.
    • getMultipleString: Gets a multiple string property as a list of str objects.
    • getSingleOrMultipleString: A combination of getStringStream and getMultipleString which prefers a single string stream. Returns a single bytes objecct or a list of bytes objects.
    • getPropertyVal: Shortcut for instance.props.getValue that allows new behavior to be added by overriding it.
    • getNamedProp: Shortcut for instance.namedProperties.get((propertyName, guid), default) that allows new behavior to be added by overriding it.
  • Removed Named._getStringStream and Named.sExists. The named properties storage will always use regular streams and not string streams.
  • Changed all Named methods to no longer have a prefix argument. The prefix should always be false sense the named property mapping will only exist in the top level directory.
  • Adjusted tryGetMimeType to allows any attachments whose data property would return a bytes instance.
  • Changed internal code to use public API functions wherever possible. This includes making many private API functions use calls to the public API for getting bits of data.
  • Fixed potential issue with AttachmentBase.clsid which had the potential to cause some attachments to fail to generate a CLSID.
  • Outright removed or changed a significant portion of the private API. I have rarely, if ever, seen references to these parts, so this should cause you no issues. Some of these have also been moved to the public API, either identically or with changes, and the mapping is as such:
    • _getNamedAs -> getNamedAs: Changed to always require a conversion argument. If you were previously using it to plainly get a named property or to handle the properly being None or a real value, you should use the return value of getNamedProp instead.
    • _getPropertyAs -> getPropertyAs: Same as above, use getPropertyVal instead for None or plain access.
    • _getStreamAs -> getStreamAs, getStringStreamAs: Once again, see above. Use getStream and getStringStream, respectively.

Version 0.44.0

03 Aug 21:39
4f0954e
Compare
Choose a tag to compare

v0.44.0

  • Fixed a bug that caused MessageBase.headerInit to always return False after the 0.42.0 update.
  • Changed MessageBase.headerInit to a property.
  • Fixed extract_msg.utils.__all__.
  • Minor regoanization within extract_msg/utils.py.
  • Minor changes to docstrings.
  • Minor README updates.
  • Fix issue with folded header fields decoding incorrectly when given to extract_msg.utils.decodeRfc2047.

Version 0.43.0

03 Aug 20:02
aab4c17
Compare
Choose a tag to compare

v0.43.0

  • [TeamMsgExtractor #56] [TeamMsgExtractor #248] Added new function MessageBase.asEmailMessage which will convert the MessageBase instance, if possible, to an email.message.EmailMessage object. If an embedded MSG file on a MessageBase object is of a class that does not have this function, it will simply be attached to the instance as bytes.
  • Changed imports in message_base.py to help with type checkers.
  • Changed from using email.parser.EmailParser to email.parser.HeaderParser in MessageBase.header.
  • Changed some of the internal code for MessageBase.header. This should improve usage of it, and should not have any noticeable negative changes. You man notice some of the values parse slightly differently, but this effect should be mostly suppressed.

Version 0.42.2

03 Aug 03:03
8fc4480
Compare
Choose a tag to compare

v0.42.2

  • Fix bug in AttachmentBase.mimetype that would cause it to throw an error when accessed. This bug was introduced in v0.42.0.

Version 0.42.1

31 Jul 13:37
51746cc
Compare
Choose a tag to compare

v0.42.1

  • Fixed some constants being accessed with the wrong name (names were changed in reorganization).
  • Removed unused regular expression.

Version 0.42.0

29 Jul 23:16
3896db4
Compare
Choose a tag to compare

v0.42.0

  • [TeamMsgExtractor #372] Changed the way that the save functions return a value. This makes the return value from all save functions much more informative, allowing a user to separate if a file or folder (or if more than one) was saved from the function. It also guarantees that all classes from this module will return the relevant path(s) if data is actually saved.
  • [TeamMsgExtractor #288] Added feature to allow attachment save functions to simply overwrite existing files of the same name. This can be done with the overwriteExisting keyword argument from code or the --overwrite-existing option from the command line.
  • [TeamMsgExtractor #40] Added new submodule custom_attachments. This submodule provides an extendable way to handle custom attachment types, attachment types whose structure and formatting are not defined in the Microsoft documentation for MSG files. This includes a handler to at least partially cover support for Outlook images.
  • [TeamMsgExtractor #373] Added the encoding submodule for encoding tasks, including proper support for Microsoft's implementation of CP950. This gets added to the codecs list as "windows-950".
    • Added infrastructure to make it easy to add variable-byte (up to two bytes) encodings and single-byte encodings.
    • Added the following encodings:
      • windows-874
      • x-mac-ce
      • x-mac-cyrillic
      • x-mac-greek
      • x-mac-icelandic
      • x-mac-turkish
  • Fixed an issue in the save functions that left the possibility for the zip files to not end up closing if the save function created it and then had an exception.
  • Added new property AttachmentBase.clsid which returns the listed CLSID value of the data stream/storage of the attachment.
  • Changed internal behavior of MSGFile.attachments. This should not cause any noticeable changes to the output.
  • Refactored code significantly to make it more organized.
  • Changed the exports from the main module to only include an important subset of the module. For other items, you'll have to import the submodule that it falls under to access it. Submodules export all important pieces, so it will be easier to find.
    • This includes having many modules be under entirely new paths. Some of these changes have been done with no deprecation, something I generally try to avoid. This is happening at the same time as the public api is significantly changing, which makes it more acceptable.
  • Fixed __main__ using the wrong enum for error behavior.
  • Fixed Named.get being severely out of date (it's not used anywhere by the module which is why it wasn't noticed).
  • Fixed Named.__getitem__ being entirely case-sensitive.
  • Switched much of the internal code (and the treePath property of all classes that have it) to using weakref.ReferenceType to avoid hard cyclic references.
  • Fixed Recipient._getTypedStream never returning a value.
  • Added additional type hints in various places.
  • Modified tests.py to only run if it is run as a file instead of imported.
  • Changed knownMsgClass to a private function since it is explicitly not being exported by any part of the module.
  • Removed unused function getFullClassName.
  • Fixes to the HTML body when saving as HTML will no longer require the preparedHtml/--prepared-html option.
  • Removed unused exceptions.
  • Entirely reoganized the way attachments are initialized, including the class that will be used in various circumstances. Embedded MSG files, custom attachments, and web attachments will all use dedicated classes that are subclasses of AttachmentBase.
    • With this change, the way to specify a new Attachment class is to override the function used when creating attachments. This can be done by passing attachmentInit = myFunction as an option to openMsg. This function MUST return an instance of AttachmentBase.
  • Added first implementation of web attachments. Saving is not currently possible, but basic relevant property access is now possible. Saving will not be stopped by this attachment if skipNotImplemented = True is passed to the save function.
  • Changed the option to suppress RTFDE errors to fall under the ErrorBehavior enum. Usage of the original option will be allowable, but is being marked as deprecated. However, it is still a dedicated option from the command line.
    • Also fixed the option not properly ignoring some RTFDE errors, specifically the ones that it is normal for the module to throw.
  • Removed some constants that are not used by the module.
  • Updated to support RTFDE version 0.1.0. Users encountering random errors from that module should find that those errors have disappeared. If you get errors from it still, bring up the issue on their GitHub.
  • Fixed bug that would cause weird behavior if you gave an empty string as the path for an MSG file.
  • Added support for IPM.StickyNote.
  • Fixed an issue that would cause MSG file to never close if an error happened during any of the __init__ functions for MSG classes.
  • Removed unneeded chardet dependency.
  • Removed Contact.__init__ as it didn't provide any unique behavior.
  • Changed the documentation of openMsg to specify that it accepts all options recognized by MSGFile subclasses, allowing the doc string to not be modified every time one of them is changed.
    • Changed the documentation of various __init__ methods to do the same thing.
  • Added dataType property to AttachmentBase and SignedAttachment for checking the class that the data will be, if accessible. Returns None if the data is inaccessible, including because accessing it would throw an exception.
  • Added new enum InsecureFeatures and option insecureFeatures. This option will allow certain features with security implications to be used for files that you trust. Currently the only feature it supports is the usage of PIL/Pillow to open and modify images. All features like this will be opt-in to reduce possible vulnerabilities.
  • Modified all custom exceptions the module uses to derive from a single base class for better organization.
    • Added new exceptions to handle some of the situations previously handled by base Python exceptions.
  • Changed internal handling of the prefix option for MSGFile.__init__ (and therefore openMsg). If you are not setting this manually, you should notice little difference.
  • Made enums less strict and converted all using fromBits to be IntFlag enums.
  • Fixed CalendarBase.keywords being blatantly incorrect (it was so bad I don't know how it slipped through).
  • Fixed Contact.gender being blatantly incorrect.
  • Fixed sender not being properly decoded in some circumstances.
  • Changed behavior of MSGFile to have olefile raise defects of type DEFECT_INCORRECT and above instead of just DEFECT_FATAL. Uncaught issues of DEFECT_INCORRECT can often cause the module to have parsing issues that may be misleading, this just ensures the issue is clarified. This behavior can be reverted back to the previous with ErrorBehavior.OLE_DEFECT_INCORRECT.
  • Fixed potential issues that may have made is possible for certain attachments to ignore filename conflict resolution code.

Version 0.41.5

11 Jun 17:18
3cffc2e
Compare
Choose a tag to compare

v0.41.5

  • Fixed an issue from version 0.41.3 where the header being present but missing the From field would cause an exception.

Version 0.41.4

11 Jun 16:42
9138590
Compare
Choose a tag to compare

v0.41.4

  • Fixed an issue in the last version that would break the decoding function if the contents were not encoded.
  • Updated tzlocal and allow future updates for compressed_rtf and ebcdic.