Releases: TeamMsgExtractor/msg-extractor
Releases · TeamMsgExtractor/msg-extractor
Version 0.46.1
v0.46.1
- [TeamMsgExtractor #394] Fix typo in props that caused the wrong number of bytes to be given to a struct.
Version 0.46.0
v0.46.0
- [TeamMsgExtractor #95] Adjusted the
overrideEncoding
property ofMSGFile
to allow automatic encoding detection. Simply set the property to the string"chardet"
and, assuming thechardet
module is installed, it will analyze a number of the strings to try and form a consensus about the encoding. This will ignore the specified encoding only if if successfully detects. Otherwise it will log a warning and fall back to the default behavior. - [TeamMsgExtractor #387] Changed
extract_msg.utils.decodeRfc2047
to not throw decoding errors if the content given is not ASCII. - [TeamMsgExtractor #387] Changed header parsing policy to
email.policy.compat32
to prevent partial parsing of quoted header fields. - [TeamMsgExtractor #388] Updated documentation of
MSGFile.export
to specify that updated fields on anMSGFile
instance (and it's subclasses) will not be reflected in the result of the function. Many of the functions do use the newest version of a cached_property, but this is not one of them. - Removed methods deprecated in
v0.45.0
. - Changed the base class of
EntryID
from no base class toabc.ABC
. - Added
position
property toEntryID
to tell how many bytes were used to create theEntryID
. - Added a number of properties to
MSGFile
from [MS-OXCMSG]. - Moved some properties down to
MessageBase
from it's subclasses. - Added support for Journal objects.
- Changed internal code of
PermanentEntryID
to correctly parse the data. Previously the distinguished name did not actually end at the null character, instead ending at the end of the bytes provided. If there was trailing data, it would be captured inadvertently. - Finished definition for
StoreObjectEntryID
. - Added new keyword arguments for MSG files:
dateFormat
anddatetimeFormat
. These allow the user to easily override the strings being used for format dates and dates that include a time component, respectively.- In unifying all the formats into 2 options, you may notice that some will look a bit different starting from this version, as there was an unfortunately large amount of variation.
- Fixed code for
MessageBase.parsedDate
which could have incorrect values. - Fixed issues with
MessageBase.date
and related things either being incorrectly documented or doing things that are not specified by the documentation. It was supposed to have been changed to usedatetime
objects, but it was still using strings. - Removed unused function
extract_msg.utils.isEmptyString
. - Removed unused function
extract_msg.utils.properHex
. - Added a
getStream
helper function toCustomAttachmentHandler
. - Added a
getStreamAs
helper function toCustomAttachmentHandler
. - Added new custom attachment handler for journal-associated attachments.
- Changed
EntryID.autoCreate
to returnNone
if givenNone
or empty bytes. - Changed
EntryID.autoCreate
to raise aFeatureNotImplemented
exception if no valid entry ID class is found. - Fix typing annotations for
CustomAttachmentHandler
. - Removed unneeded
imapclient
dependency. - Changed
getJson
to have values be null if they aren't found rather than an empty string. - Implemented the
getJson
method correctly for a number of classes. - Changed
Task.percentComplete
to always return a float. - Changed the
NotImplementedError
for custom attachment handler not being found toFeatureNotImplemented
. Additionally, changed the error message to specify the CLSID found on the attachment to better enable people to report issues. - Changed code for
Recipient
andMessageBase
that makes it rely onMessageBase.recipientTypeClass
to determine the class to use for therecipientType
property. Adjusted the typing ofRecipient
to have it reflect the type that will be used. - Correctly changed the returned value for
ResponseStatus.fromIter
to actually return a List instead of a set. - Filled out typing information for a significant portion of the module where variables or functions were missing it. This includes the entirety of the constants submodule.
- Corrected a number of minor issues.
- Extended values for
DVAspect
enum. - Added new enums to go with parsing for
OLEPresentationStream
. - Changed
NNTPNewsgroupFolderEntryID.newsgroupName
to bytes instead of string since it is ANSI. - Fixed an issue that would cause headers to fail to parse properly if the header text starts with "Microsoft Mail Internet Headers Version 2.0" which is common on some MSG files. This is fixed by stripping that from the beginning before actually parsing the text. This is to circumvent CPython issue #85329, confirmed to still exist in at least some of the supported Python versions.
- Added
listDir
andslistDir
as methods toAttachmentBase
,Recipient
, andNamed
. These always exclude the prefix, returning as if their directory is the root of the object. This allows the named to be directly used for accessing those files. - Numerous spelling fixes in docstrings, comments, and exceptions.
- Reduced the amount of initialization performed by
MessageBase
. Much of this initialization was there from before a lot of stuff changed tocached_property
and a number of internal variables were being used. Now all of the relevant variables will be initialized by the way they are accessed. - Added new exception
DependencyError
. - Changed the errors for missing optional dependencies from
ImportError
toDependencyError
. - Removed all instances of the
rawData
property in favor of thetoBytes
method. For now, many of these will simply return the raw data used, specifically those that are still unmodifiable. Any whose properties have the ability to be modified will have properly implemented versions. These classes also allowNone
to be passed as the value for their data, which will be the default if no arguments have been passed to the constructor. If no arguments orNone
is given as the data, it will create a new instance with default values. This is all in an effort to move towards the ability to create new MSG files and theMSGWriter
class. AlltoBytes
methods will either exclusively returnbytes
or will returnNone
to specify that the structure isn't valid to convert to bytes. Structures that may be invalid will be annotated asOptional[bytes]
for the return type.- Additionally, these objects will also support the
__bytes__
method. If the object returned is not bytes, the method will throw aTypeError
.
- Additionally, these objects will also support the
- Removed the individual
PropBase
flag properties and changed the mainflags
property to return an enum containing the flags. - Changed various data structs to allow modification and creation of new instances for writing to an MSG file.
- Changed
TZRule
to use unsigned values where applicable. - Changed
TZRule
to require the 14 null bytes (I commented it out completely on accident instead of swapping it to a plain read). It now logs a warning about the bytes not being null. - Removed unneeded function
windowsUnicode
. - Moved
FixedLengthProperty.parseType
to the private API. This was not intended for external use anyways, so leaving it as public API didn't make sense. - Fixed check for type in
ContactAddressEntryID
being the wrong value. - Modified
inputToBytes
to support objects with the__bytes__
method. If the method exists and works then it will be used as a last resort. - Modified
OleWriter
to accept objects with a__bytes__
method for the data to use for an entry. - Added
__bytes__
method toMSGFile
. This is equivalent to callingMSGFile.exportBytes
.
Verison 0.45.0
v0.45.0
- BREAKING: Changed parsing of string multiple properties to remove the trailing null byte. This will cause the output of parsing them to differ.
- Updated typing information for some functions and classes.
- Fixed a bug with
MessageSignedBase.attachments
that would cause it to return None instead of an empty list if the number of normal attachments was 0 was the error behavior was set to ignore violations of the standard. - Updated
MessageSignedBase.attachments
to usefunctools.cached_property
instead ofproperty
. - Fixed spelling errors in some exception strings.
- Made
NamedPropertyBase
a subclass ofabc.ABC
. - Cleaned up some of the code for named properties to remove unused variables and remove inefficient code.
- Changed
PropBase
to be a subclass ofabc.ABC
. - Added detailed versioning info to the README.
- Deprecated many private functions, including methods on many of the classes. Of primary note are
_getStream
and_getStringStream
, which have been moved to the public API asgetStream
andgetStringStream
. Any deprecated functions still exist and will forward to a public API function if they are not being removed. Additionally, all internal usage of them has been removed. This change is one of the big preparations that is needed for the1.0.0
release.- As mentioned, a number of these deprecated functions have been moved to the public API. It is recommended that you run tests with your code after enabling deprecation warnings to see what should be changed.
- Removed items deprecated in or before
0.42.0
. - Changed the API for the private method
_genRecipient
. This is not intended for use outside of the module except for subclasses. The change removed the allowance of ints for the second argument, requiring that it be a valid enum type. - Convert many enum types to
IntEnum
. - Extended functionality of
PropertiesStore
to allow for integer property names and getting a property based on just the ID. You can also get a list of all properties that use a given ID. - Added new function
PropertiesStore.getProperties
which gets a list of all properties matching the property ID. Return type is a list ofPropBase
instances. - Added new function
PropertiesStore.getValue
which looks for the first matchingFixedLengthProp
and returns the value from it. - Improved internal code related to getting a property with a potentially unknown type.
- Added a number of entirely new functions to the public API on
MSGFile
,AttachmentBase
,PropertiesStore
, andRecipient
objects:getMultipleBinary
: Gets a multiple binary property as a list ofbytes
objects.getSingleOrMultipleBinary
: A combination ofgetStream
andgetMultipleBinary
which prefers a single binary stream. Returns a singlebytes
object or a list ofbytes
objects.getMultipleString
: Gets a multiple string property as a list ofstr
objects.getSingleOrMultipleString
: A combination ofgetStringStream
andgetMultipleString
which prefers a single string stream. Returns a single bytes objecct or a list of bytes objects.getPropertyVal
: Shortcut forinstance.props.getValue
that allows new behavior to be added by overriding it.getNamedProp
: Shortcut forinstance.namedProperties.get((propertyName, guid), default)
that allows new behavior to be added by overriding it.
- Removed
Named._getStringStream
andNamed.sExists
. The named properties storage will always use regular streams and not string streams. - Changed all
Named
methods to no longer have a prefix argument. The prefix should always be false sense the named property mapping will only exist in the top level directory. - Adjusted
tryGetMimeType
to allows any attachments whosedata
property would return abytes
instance. - Changed internal code to use public API functions wherever possible. This includes making many private API functions use calls to the public API for getting bits of data.
- Fixed potential issue with
AttachmentBase.clsid
which had the potential to cause some attachments to fail to generate a CLSID. - Outright removed or changed a significant portion of the private API. I have rarely, if ever, seen references to these parts, so this should cause you no issues. Some of these have also been moved to the public API, either identically or with changes, and the mapping is as such:
_getNamedAs
->getNamedAs
: Changed to always require a conversion argument. If you were previously using it to plainly get a named property or to handle the properly being None or a real value, you should use the return value ofgetNamedProp
instead._getPropertyAs
->getPropertyAs
: Same as above, usegetPropertyVal
instead for None or plain access._getStreamAs
->getStreamAs
,getStringStreamAs
: Once again, see above. UsegetStream
andgetStringStream
, respectively.
Version 0.44.0
v0.44.0
- Fixed a bug that caused
MessageBase.headerInit
to always returnFalse
after the 0.42.0 update. - Changed
MessageBase.headerInit
to a property. - Fixed
extract_msg.utils.__all__
. - Minor regoanization within
extract_msg/utils.py
. - Minor changes to docstrings.
- Minor README updates.
- Fix issue with folded header fields decoding incorrectly when given to
extract_msg.utils.decodeRfc2047
.
Version 0.43.0
v0.43.0
- [TeamMsgExtractor #56] [TeamMsgExtractor #248] Added new function
MessageBase.asEmailMessage
which will convert theMessageBase
instance, if possible, to anemail.message.EmailMessage
object. If an embedded MSG file on aMessageBase
object is of a class that does not have this function, it will simply be attached to the instance as bytes. - Changed imports in
message_base.py
to help with type checkers. - Changed from using
email.parser.EmailParser
toemail.parser.HeaderParser
inMessageBase.header
. - Changed some of the internal code for
MessageBase.header
. This should improve usage of it, and should not have any noticeable negative changes. You man notice some of the values parse slightly differently, but this effect should be mostly suppressed.
Version 0.42.2
v0.42.2
- Fix bug in
AttachmentBase.mimetype
that would cause it to throw an error when accessed. This bug was introduced inv0.42.0
.
Version 0.42.1
v0.42.1
- Fixed some constants being accessed with the wrong name (names were changed in reorganization).
- Removed unused regular expression.
Version 0.42.0
v0.42.0
- [TeamMsgExtractor #372] Changed the way that the save functions return a value. This makes the return value from all save functions much more informative, allowing a user to separate if a file or folder (or if more than one) was saved from the function. It also guarantees that all classes from this module will return the relevant path(s) if data is actually saved.
- [TeamMsgExtractor #288] Added feature to allow attachment save functions to simply overwrite existing files of the same name. This can be done with the
overwriteExisting
keyword argument from code or the--overwrite-existing
option from the command line. - [TeamMsgExtractor #40] Added new submodule
custom_attachments
. This submodule provides an extendable way to handle custom attachment types, attachment types whose structure and formatting are not defined in the Microsoft documentation for MSG files. This includes a handler to at least partially cover support for Outlook images. - [TeamMsgExtractor #373] Added the
encoding
submodule for encoding tasks, including proper support for Microsoft's implementation of CP950. This gets added to the codecs list as "windows-950".- Added infrastructure to make it easy to add variable-byte (up to two bytes) encodings and single-byte encodings.
- Added the following encodings:
- windows-874
- x-mac-ce
- x-mac-cyrillic
- x-mac-greek
- x-mac-icelandic
- x-mac-turkish
- Fixed an issue in the save functions that left the possibility for the zip files to not end up closing if the save function created it and then had an exception.
- Added new property
AttachmentBase.clsid
which returns the listed CLSID value of the data stream/storage of the attachment. - Changed internal behavior of
MSGFile.attachments
. This should not cause any noticeable changes to the output. - Refactored code significantly to make it more organized.
- Changed the exports from the main module to only include an important subset of the module. For other items, you'll have to import the submodule that it falls under to access it. Submodules export all important pieces, so it will be easier to find.
- This includes having many modules be under entirely new paths. Some of these changes have been done with no deprecation, something I generally try to avoid. This is happening at the same time as the public api is significantly changing, which makes it more acceptable.
- Fixed
__main__
using the wrong enum for error behavior. - Fixed
Named.get
being severely out of date (it's not used anywhere by the module which is why it wasn't noticed). - Fixed
Named.__getitem__
being entirely case-sensitive. - Switched much of the internal code (and the
treePath
property of all classes that have it) to usingweakref.ReferenceType
to avoid hard cyclic references. - Fixed
Recipient._getTypedStream
never returning a value. - Added additional type hints in various places.
- Modified tests.py to only run if it is run as a file instead of imported.
- Changed
knownMsgClass
to a private function since it is explicitly not being exported by any part of the module. - Removed unused function
getFullClassName
. - Fixes to the HTML body when saving as HTML will no longer require the
preparedHtml
/--prepared-html
option. - Removed unused exceptions.
- Entirely reoganized the way attachments are initialized, including the class that will be used in various circumstances. Embedded MSG files, custom attachments, and web attachments will all use dedicated classes that are subclasses of
AttachmentBase
.- With this change, the way to specify a new
Attachment
class is to override the function used when creating attachments. This can be done by passingattachmentInit = myFunction
as an option toopenMsg
. This function MUST return an instance ofAttachmentBase
.
- With this change, the way to specify a new
- Added first implementation of web attachments. Saving is not currently possible, but basic relevant property access is now possible. Saving will not be stopped by this attachment if
skipNotImplemented = True
is passed to the save function. - Changed the option to suppress
RTFDE
errors to fall under theErrorBehavior
enum. Usage of the original option will be allowable, but is being marked as deprecated. However, it is still a dedicated option from the command line.- Also fixed the option not properly ignoring some
RTFDE
errors, specifically the ones that it is normal for the module to throw.
- Also fixed the option not properly ignoring some
- Removed some constants that are not used by the module.
- Updated to support
RTFDE
version0.1.0
. Users encountering random errors from that module should find that those errors have disappeared. If you get errors from it still, bring up the issue on their GitHub. - Fixed bug that would cause weird behavior if you gave an empty string as the path for an MSG file.
- Added support for
IPM.StickyNote
. - Fixed an issue that would cause MSG file to never close if an error happened during any of the
__init__
functions for MSG classes. - Removed unneeded
chardet
dependency. - Removed
Contact.__init__
as it didn't provide any unique behavior. - Changed the documentation of
openMsg
to specify that it accepts all options recognized byMSGFile
subclasses, allowing the doc string to not be modified every time one of them is changed.- Changed the documentation of various
__init__
methods to do the same thing.
- Changed the documentation of various
- Added
dataType
property toAttachmentBase
andSignedAttachment
for checking the class that the data will be, if accessible. ReturnsNone
if the data is inaccessible, including because accessing it would throw an exception. - Added new enum
InsecureFeatures
and optioninsecureFeatures
. This option will allow certain features with security implications to be used for files that you trust. Currently the only feature it supports is the usage ofPIL
/Pillow
to open and modify images. All features like this will be opt-in to reduce possible vulnerabilities. - Modified all custom exceptions the module uses to derive from a single base class for better organization.
- Added new exceptions to handle some of the situations previously handled by base Python exceptions.
- Changed internal handling of the
prefix
option forMSGFile.__init__
(and thereforeopenMsg
). If you are not setting this manually, you should notice little difference. - Made enums less strict and converted all using
fromBits
to beIntFlag
enums. - Fixed
CalendarBase.keywords
being blatantly incorrect (it was so bad I don't know how it slipped through). - Fixed
Contact.gender
being blatantly incorrect. - Fixed sender not being properly decoded in some circumstances.
- Changed behavior of
MSGFile
to haveolefile
raise defects of typeDEFECT_INCORRECT
and above instead of justDEFECT_FATAL
. Uncaught issues ofDEFECT_INCORRECT
can often cause the module to have parsing issues that may be misleading, this just ensures the issue is clarified. This behavior can be reverted back to the previous withErrorBehavior.OLE_DEFECT_INCORRECT
. - Fixed potential issues that may have made is possible for certain attachments to ignore filename conflict resolution code.
Version 0.41.5
v0.41.5
- Fixed an issue from version
0.41.3
where the header being present but missing theFrom
field would cause an exception.
Version 0.41.4
v0.41.4
- Fixed an issue in the last version that would break the decoding function if the contents were not encoded.
- Updated
tzlocal
and allow future updates forcompressed_rtf
andebcdic
.