Scraping own Next.js site that is protected by Auth0

Hi everyone.

I am working on a project in Next.js where the majority of the pages are only visible to logged in users. However I also need to create a script that will periodically scrape text from these pages to populate an index for a site search engine.

I can perform a manual scrape using Puppeteer either by connecting to a browser that is already logged in, or by creating a new browser instance and passing in an email and password and accessing the pages that way. However when I try and run Chromium headlessly this method fails as the page goes to a captcha after entering my creds, so unsuitable for running as a periodic cron job.

I was wondering if there was a method of using the auth0 management API to get a token that I can insert into my headless browser as a cookie to bypass the login screen and allow me access to the pages directly?

Thanks in advance, I’m only a basic user of auth0 so apologies if there is an obvious answer staring me in the face.

Hi there @jon.pitans welcome to the community!

Thanks for the detailed description :slight_smile:

You could enable the ROPG grant type for your application and get a token to using the Authentication API directly. I believe this would allow you to get around the captcha issue.

Thanks for the quick reply. I think I am already part way down that path, I have a machine to machine application set up and am using a password grant passing a valid users details and receiving an id_token and access_token, which seems to be working correctly.

What I don’t understand (and can’t find a clear answer for in docs) is how to use these tokens to bypass the login page. Is one of these a session cookie or do I need to do a further token exchange? Is the login process checking for a session cookie, or a bearer token in the headers (or both)? Also when setting the cookie to the browser what name does it take? I have found various options - auth0, auth0_compat, legacy_auth0, auth0SSO, the list goes on. Also what should the cookie domain be set to? Currently I am using the domain from the basic information panel on the auth dashboard. Is this right or should I be using the domain for the actual site?

If it is any help, in a standard logged in browser there are two cookies both with the domain of my site call appSession.0 and appSession.1 that are long strings that look similar to the access_token and id_token I am getting.

Again thanks in advance for any pointers.