Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation parser fixes #4853

Merged
merged 7 commits into from
Oct 23, 2017
Merged

Documentation parser fixes #4853

merged 7 commits into from
Oct 23, 2017

Conversation

chenriksson
Copy link
Member

@chenriksson chenriksson commented Oct 17, 2017

Fixes #4783 and #4816, plus restricts markdown links to http/s

var encoded = HttpUtility.HtmlEncode(readMeMd);
var encodedLines = encoded.Replace("\r\n", "\n").Split('\n');

var blockQuotePattern = new Regex("^ {0,3}>");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better if blockQuotePattern was a static readonly property?

Copy link
Contributor

@loic-sharma loic-sharma Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this HtmlEncodeExceptBlockquotes method seems rather expensive. Would it be worthwhile to check if readMeMd has the > substring, and if it doesn't, simply return encoded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will play with this to see if I can optimize it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to single MultiLine regex applied to entire markdown

{
CommonMarkConverter.ProcessStage3(document, htmlWriter, settings);

var regex = new Regex("<a href=([\"\']).*?\\1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this regex also a static readonly.

Copy link
Contributor

@loic-sharma loic-sharma Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is CommonMarkConverter guaranteed to always generate links with the format <a href=? If not, this regex may not match all links. Would a simple string replace <a to <a rel="nofollow" work too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current version always has href attribute first, then possible title attribute. There's an open issue to add support for other link attributes, but it won't be added until the commonmark spec includes it.

{
// Block javascript in links.
if ((inline.Tag == InlineTag.Link) &&
(inline.TargetUrl.IndexOf("javascript", StringComparison.InvariantCultureIgnoreCase) >= 0))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to only allow target urls that use http/https.

@chenriksson
Copy link
Member Author

@blowdart Can you have a look?

@blowdart
Copy link
Member

LGTM


namespace NuGetGallery
{
internal class ReadMeService : IReadMeService
{
private static readonly Regex EncodedBlockQuotePattern = new Regex("^ {0,3}&gt;", RegexOptions.Multiline);
private static readonly Regex CommonMarkLinkPattern = new Regex("<a href=([\"\']).*?\\1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel uneasy about using regexes to insert the rel=nofollow. Are you sure CommonMark always generates links that start with <a href=?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex was added in PR #4841. I discussed with @ryuyu and we agreed the regex was probably better than forking CommonMark.Net.

For CommonMark.NET link formatting see HtmlFormatterSlim.

CommonMark.NET had a request to support link attributes, but it won't be added unless attribute extensions are added to the underlying commonmark spec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add timeout for these regexes, as pointed out in other PRs.

@chenriksson chenriksson merged commit f5ab583 into dev Oct 23, 2017
@chenriksson chenriksson deleted the chenriks-doc-bugs branch October 23, 2017 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants