Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cdp support for xpath scrapers #625

Merged
merged 18 commits into from
Aug 4, 2020
Merged

Conversation

bnkai
Copy link
Collaborator

@bnkai bnkai commented Jun 20, 2020

Adds an extra driver option to the scraper config
Using

driver:
  useCDP: true
  remote: true
  sleep: 3

makes loadURL to use chrome cdp to get the url and thus partially parse it.
Setting useCDP to false or not having the driver option http client is used instead.
If remote is set to true then cdp looks for a remote instance of chrome in the address 127.0.0.1:9222 that is compatible with the chrome headless docker https://hub.docker.com/r/chromedp/headless-shell/
If it is unset or set to false it defaults to looking google-chrome in the $PATH
sleep is the time in seconds to wait before actually geting the page from dom. This is needed as some sites (bang.com for example) need more time for loading scripts to finish. If unset it defaults to 2 secs.

Having this option allows us to create scrapers for sites that pull data from js.

The following puretaboo scraper is possible ( performers and tags can't be retrieved without cdp)

name: puretaboo
sceneByURL:
  - action: scrapeXPath
    url: 
      - https://www.puretaboo.com/en/video
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    scene:
      Title: //div[@class="styles_Rms8xL5kg6"]/h2
      Image: //video[@class="vjs-tech"]/@poster
      Date:
        selector: //span[@class="Text ScenePlayer-ReleaseDate-Text styles_3tU3Z2sLeO"]/text()
        parseDate: 2006-01-02
      Details:
        selector: //meta[@name="twitter:description"]/@content
        replace:
          - regex: </br>|<br>
            with: "\n"
      Tags:
        Name:
          selector: //div[@class="BackgroundBox ScenePlayer-SceneCategories-BackgroundBox styles_1khKtnnA8W"]/a
      Performers:
        Name:
          selector: //div[@class="component-ActorThumb-List"]/a/@title
      Studio:
        Name:
          selector: //div[@class="styles_1oxPFmiuVp"]/a/@title
debug:
  printHTML: true
driver:
  useCDP: true
  remote: true
#Last Updated June 23, 2020

Use of the debug option is adviced so that you can look at the actual page text that is returned.

@bnkai
Copy link
Collaborator Author

bnkai commented Jun 20, 2020

Another example of a cdp only scraper

name: Bang
sceneByURL:
  - action: scrapeXPath
    url:
      - https://www.bang.com/video
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    scene:
      Title: //meta[@name="og:title"]/@content
      Details: //meta[@name="description"]/@content
      Image: //meta[@name="og:image"]/@content        
      Date:
        selector: //span[@class="hidden-xs fa-text-right fa-text-left"]/text()
        replace:
          - regex: (\w+\s)(\d+)\w+,(\s\d{4})
            with: $1$2$3
        parseDate: Jan 2 2006
      Tags:
        Name:
          selector: //div[@class="genres bottom-buffer10"]/a
      Performers:
        Name:
          selector: //div/span[@class="fa-text-right" and contains(text(),"With:")]/../span[@class="comma-list-container"]/span/a/text()
      Studio:
        Name: //div/span[@class="fa-text-right" and contains(text(),"Studio:")]/a/text()
debug:
  printHTML: true
driver:
  useCDP: true
  remote: true
  sleep: 5

@bnkai bnkai added the improvement Something needed tweaking. label Jun 20, 2020
@bnkai
Copy link
Collaborator Author

bnkai commented Jun 21, 2020

Added a clearCookies option so we have the below with their default values listed

driver:
  useCDP: false
  remote: false
  clearCookies: false
clearCookies clears **ALL** browser cookies and then fetches the url. It is mainly needed for managing the remote cdp instance as by default cookies are enabled and stored in the docker container. For users that use their local chrome browser this is not really needed / recomended. I added some helper funtions for setting individual cookies but from what i tested they are not needed since cdp with the remote instance used stores all cookies. If a finer control over cookies is needed later i can revisit the issue.

@WithoutPants
Copy link
Collaborator

I don't think allowing scraper configs to wipe Chrome cookies is safe or reasonable. Further, because it is in the config, there is no way of tailoring it to docker and non-docker instances.

I think another way is needed to achieve this. I'm not very clear on the protocols, but maybe this might help?

@bnkai
Copy link
Collaborator Author

bnkai commented Jun 22, 2020

The clearCookies was supposed to be for users that write scrapers , not the end users. I needed to delete some cookies while testing a cdp scraper and didn't know how to. I agree as stated above that it isn't recommended though.
I think it might be better to add support for deleting a specific cookie (name,value,domain) later on if needed and just remove the clearCookies option.

I'll have a look at what you reference, if it creates something like an incognito mode i'll implement it in a later PR.

EDIT merged with upstream code. Removed clearCoookies code. Basic functionality should be complete.

@WithoutPants
Copy link
Collaborator

I'm not sure that the remote option should be a scraper-specific option. Is there a legitimate reason that two different scrapers will use different values for remote? Seems like it should be a stash configuration option.

@bnkai
Copy link
Collaborator Author

bnkai commented Jun 25, 2020

Yes i was also reluctant about this. I was either going to put it in there or a scraper option beneath the user-agent setting.
My thinking was that the remote option is only available for the cdp driver and in some scrapers users may want to use their local chrome instance to get past authentication issues ( session) as i haven't yet implemented any set cookie functionality. It doesn't matter for me though as i can add it in the UI.
Was the remote: false option working for you though in windows? I couldn't try that.

@WithoutPants
Copy link
Collaborator

Tried the following:

  • copied the puretaboo scraper as-is and tested without changing any paths or docker stuff. Scraped, but did not set performers/tags
  • set remote to false and retested. Same results as above.
  • added my chrome directory to the path environment variable and retried. Same results as above.

FWIW chrome isn't named google-chrome on Windows. I'm not sure how else to get this to work.

@bnkai
Copy link
Collaborator Author

bnkai commented Jul 1, 2020

Thats weird.
If you use cdp and set the remote correctly it should fetch the site. If the performers / tags are not shown you might need to increase the sleep option from the default 2 when not set ( probably the scripts loading the performers/tags werent finished in time )

If there is an error locating a chrome instance you should get a red graphql message like this in the UI
ErrorGraphQL error: exec: "google-chrome": executable file not found in $PATH
that was my error when i set remote to false without having a local instance.

i dont know how cdp locates the chrome binary in windows but since you got some results i am fairly certain it got detected and used. Maybe having a look at the process manager in windows while running a stash cdp query can verify that. (You can increase the sleep option to 5 to force it to wait a little more)

Running against the following urls:

https://www.puretaboo.com/en/video/puretaboo/mommy-monster/132502
https://www.puretaboo.com/en/video/puretaboo/family-vacation/144431
https://www.puretaboo.com/en/video/puretaboo/nerd's-breaking-point/148252
https://www.puretaboo.com/en/video/puretaboo/resisting-arrest/149296
https://www.puretaboo.com/en/video/puretaboo/half-his-age---part-2/126933
https://www.puretaboo.com/en/video/puretaboo/half-his-age%3A-bts-featurette/127740
https://www.puretaboo.com/en/video/puretaboo/swapping-daughters%3A-the-other-family/144420
https://www.puretaboo.com/en/video/puretaboo/dibs-on-mom/147694

i get everything scraped as expected from pure taboo
https://pastebin.com/9eHZK4re

@WithoutPants
Copy link
Collaborator

My test system already had a puretaboo scraper (under a different filename) which was interfering with this test.

I can confirm that the scraper appears to give the correct results when I add the chrome directory to the PATH. I haven't tried the remote configuration.

I think the chrome configuration needs to be put into the general configuration. A free text field that accepts either a remote chrome instance URL, or a path to the chrome executable. This way the behaviour is clear and deterministic. remote can then be removed from the scraper configuration.

@bnkai
Copy link
Collaborator Author

bnkai commented Aug 2, 2020

@WithoutPants your PR changes work fine using the headless-shell docker as a remote url ( /json/version needs to be present in the url, i added that as an example), thanks.
Adding as a path also worked ( used /usr/bin/google-chrome to be exact)

@WithoutPants WithoutPants added this to the Version 0.3.0 milestone Aug 4, 2020
@WithoutPants WithoutPants merged commit 4373f9b into stashapp:develop Aug 4, 2020
Tweeticoats pushed a commit to Tweeticoats/stash that referenced this pull request Feb 1, 2021
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Something needed tweaking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants