Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill repo metadata tool #6288

Merged
merged 5 commits into from
Aug 15, 2018
Merged

Backfill repo metadata tool #6288

merged 5 commits into from
Aug 15, 2018

Conversation

skofman1
Copy link
Contributor

@skofman1 skofman1 commented Aug 9, 2018

/// This tool collect repository metadata for all packages in the DB from nuspec files in V3 flat container and updates DB with this data.
/// Usage:
/// 1. To collect repository metadata:
/// a. Configure app.config with DB information and service index url
/// b. Run this tool with: GalleryTools.exe -c
/// This will create a file repositoryMetadata.txt with all collected data. You can stop the job anytime and restart. cursor.txt contains current position.
///
/// 2. To update DB:
/// a. Run GalleryTools.exe -u
/// This will update DB from file repositoryMetadata.txt. You can stop the job anytime and restart.


<add key="Gallery.SqlServer" value="Data Source=(localdb)\mssqllocaldb; Initial Catalog=NuGetGallery; Integrated Security=True; MultipleActiveResultSets=True"/>
<add key="Gallery.SupportRequestSqlServer" value="Data Source=(localdb)\mssqllocaldb; Initial Catalog=NuGetGallery; Integrated Security=True; MultipleActiveResultSets=True"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Gallery.SupportRequestSqlServer added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultDependenciesModule can't be created without it.

@joelverhagen
Copy link
Member

Spoke offline about this, but we should have a README.md that is next to the csproj that describes how this command works. Granted, the other commands don't have docs but we can improve this as we go on.

CommandOption fileName = config.Option("-f | --file", "File to use", CommandOptionType.SingleValue);

config.HelpOption("-? | -h | --help");
config.OnExecute(() =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OnExecute can accept a Func<Task<int>>. This is means the whole method can be async and you can avoid .GetAwaiter().GetResult().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{
Log.LogMessage("Initializing");

_context = new EntitiesContext(connectionString, readOnly: true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, readonly. nice.


Log.LogMessage($"Start time: {startTime.ToString("u")}");

var packagesRepository = new EntityRepository<Package>(_context);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have DI above, why no use it? Makes this more flexible for the future (future backfills...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain?


var packagesRepository = new EntityRepository<Package>(_context);

var allPackages = packagesRepository.GetAll().Where(p => p.Created < lastCreateTime && p.Created > startTime && p.PackageStatusKey == PackageStatus.Available);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can miss validating packages?

Copy link
Member

@joelverhagen joelverhagen Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or less importantly failed validation packages that we reflow?

Paging over Edited will mean you never miss a package.

Also, consider adding a 5-30 minute subtract when first starting due to clock skew.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LastEdited is null for ~1M packages. Will stay with Created. Added support for validating packages. Don't see how clock skew is a factor if we compare always to Created that's DB generated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those can come in out of order. There is no guarantee that a sequence of Created timestamps are committed in the same order that they are generated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can order by LastEdited, PackageKey then. But it's fine if you include validating.


private static async Task<RepositoryMetadata> GetRepositoryMetadata(string id, string normalizedVersion)
{
string nuspecUri = $"{_flatContainerUri}{id.ToLowerInvariant()}/{normalizedVersion.ToLowerInvariant()}/{id.ToLowerInvariant()}.nuspec";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: trim trailing '/' and include '/' in the string format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_flatContainerUri might have trailing / -- or might not. It's an easy config mistake to make.

just a nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private static string GetFlatContainerUri(Uri serviceDiscoveryUri)
{
var serviceDiscoveryClient = new ServiceDiscoveryClient(serviceDiscoveryUri);
return serviceDiscoveryClient.GetEndpointsForResourceType("PackageBaseAddress/3.0.0").GetAwaiter().GetResult().First().AbsoluteUri;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be async methods. .GetAwaiter().GetResult() should only be in one place per app. (In most cases)


private static async Task WriteMetadata(DateTime creationDate, string packageId, string packageVersion, RepositoryMetadata repositoryMetadata)
{
await _metadataFileStreamWriter.WriteLineAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider one JSON object per line to avoid escaping problems. Or use a CSV library. Repository metadata could have a comma (type in particular).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using CsvHelper

}

private static async Task CommitBatch(DateTime cursorTime)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might run into memory problems since the EF context will hold onto all of the old entities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can recreate the EF context each batch to mitigate this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is indeed an issue for the "collect" part. However, introducing batching there is too much work. Will leave as is. Thanks for the cursor the job can continue after crash.


CommandOption lastCreateTimeOption = config.Option("-l | --lastcreatetime", "The latest creation time of package we should check", CommandOptionType.SingleValue);
CommandOption collectData = config.Option("-c | --collect", "Collect Repository metadata and save in file", CommandOptionType.NoValue);
CommandOption updateDB = config.Option("-u | --update", "Update DB with Repository metadata", CommandOptionType.NoValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a less aggressive starting point for how quickly updateDB updates batches. This could be done with batchSize and sleepSeconds parameters. Then you can do some batches slowly on PROD before kicking it into high gear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the same batching config as I did for "hash" Gallery tool. It saved a 100 and had no sleep. There was not DB impact.

counter = 0;
}

metadata = await TryReadNextMetadata();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do ... while should be able to eliminate the duplicated metadata = await TryReadNextMetadata(); line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will leave as is...

Copy link
Member

@joelverhagen joelverhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⌚️

Copy link
Member

@joelverhagen joelverhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@skofman1 skofman1 merged commit 1b3707e into dev Aug 15, 2018
@skofman1 skofman1 deleted the repotool branch August 15, 2018 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants