AI Tutorial: Creating a Custom ChatGPT That Extracts Data from Websites

Zachary Paul
5 min readNov 28, 2023

--

I’ve just created a GPT that extracts data from websites.

I named it ScrapeGPT. That said, it doesn’t actually go to the website and scrapes its data, or extracts the data from an HTML file as I showed you in a previous article. In this case, we’re going to use a different approach that will help us extract data from websites by just telling what data we want to get. Once we see how this works, we’re gonna quickly create a GPT to automate all this.

Just note that this method isn’t going to be appropriate for all websites out there, but it’s still powerful and easy to use. With that in mind, let’s build ScrapeGPT.

How this GPT works

Say, we want to scrape all the data related to “web scraping courses” from Udemy, so we go to Udemy, type “web scrapping” and get the search results.

In a previous article, I showed you how to extract all this data by saving this file as HTML and then giving it to ChatGPT. That was good, but we had to provide one by one the HTML elements of the data we wanted to extract (and it sometimes could fail because some HTML files are complex)

With this new method, we only need to save the page as a PDF file and specify the data to extract. That’s it.

How to save a PDF file?

  • Go to the page you want to extract data from
  • If you’re on Chrome or Edge, click on “File” and select “Print”, on “Layout” select “Landscape” and then click on “Save” to save the page as a PDF
  • If you’re on Safari, click on “File” and then select “Export as PDF”

Note that on some websites Chrome exports the PDF with incomplete sections of the page. If that’s the case use Safari or another browser (Safari has never failed me so far).

Once we have the PDF, we upload it to ChatGPT and type the following prompt.

Extract data from all the items listed in the file. Here’s the data you should extract from the first item. You have to extract the same data for the rest items

Course: Web Scraping in Python BeautifulSoup, Selenium & Scrapy 2023
Instructor: Frank Andrade
Rating: 4.4
Number of ratings: 1,087
Hours: 10 total hours

You have to put the data extracted on a table and export it to a CSV file.

- If one item has more than one price, just take the first price
- In case there’s missing data, leave that data as a null data.

In the prompt, we give as a sample the first item to extract. ChatGPT is going to extract the same data for the remaining items

and then put it into a CSV file.

Easy, right? This approach works wonders when you want to extract data from a list of items. You can even provide the second or third page from the search result to extract more data.

Let’s build the GPT

Now let’s build the GPT, so we can automate this prompt a little bit and easily reuse it.

Open the sidebar, click on Explore, and select “Create a GPT”. Go to the “Configure” options and paste the prompt in “Instructions.”

We’re going to slightly tweak the prompt in the instructions field. Here’s the final prompt.

You have to extract data from all the items listed in the file. To guide you, the user will give you the data you should extract from the first item. You have to extract the same data for as many items as you can.

You have to put the data extracted on a table and export it to a CSV file.

Notes:
- If one item has more than one price, just take the first price
- In case there’s missing data, leave that data as a null data.

Some pages might have a different structure, so feel free to add more notes, in case you need it.

I’ll name the GPT as ScrapeGPT and add a quick description. I’ll only check the Code Interpreter box in capabilities.

Here’s how the GPT looks.

Now we only need to upload the PDF and type the first sample. Let’s try with Amazon now.

Here’s what I got after typing “play station” on Amazon.

In the prompt, I’ll specify the data to extract from the first item and let ScrapeGPT get the rest of the items.

Here’s the CSV file exported.

After double-checking the CSV file, I realized there were some missing items, which was weird because when I tested the prompt without ScrapeGPT, GPT-4 successfully extracted the data from all the items.

I’ve tested different prompts in GPT-4 and GPT (created with GPT Builder) and it looks like the latter isn’t as smart. When the prompt is a bit more complex, some of my GPTs even get stuck on “searching my knowledge” or “Message in conversation not found” Have you noticed this issue with your GPTs? Hopefully, OpenAI will fix this issue soon.

--

--

Zachary Paul
Zachary Paul

No responses yet