Want to Contribute to us or want to have 15k+ Audience read your Article ? Or Just want to make a strong Backlink?

Web scraping experiment with AI (Parsing HTML with GPT-4)

I’ve all the time been shocked at how good chatGPT by OpenAI is on answering questions and the power of Dall-e 3 to create beautiful pictures. Now, with the brand new mannequin, let’s examine how AI can deal with our net scraping duties, particularly on parsing search engine outcomes. Everyone knows the drill—parsing knowledge from uncooked HTML can usually be cumbersome. However what if there is a strategy to flip this painstaking course of right into a breeze?

Not too long ago (November 2023), the OpenAI crew had their first developer conference: DevDay (Feel free to watch it first). One of many thrilling bulletins is a bigger context for GPT-4. The brand new GPT-4 Turbo mannequin is extra succesful, cheaper, and helps a 128K context window.



Our little experiment

Prior to now, we have compared some open-source and paid LLMs’ ability to scrape “clear textual content” knowledge right into a easy format and developed an AI powered parser.

This time, we’ll stage up the problem.

  • Scrape straight from uncooked HTML knowledge.
  • Flip into a particular JSON format that we’d like.
  • Use little improvement time.

Our Goal:

  • Scrape a pleasant structured web site (as a warm-up).
  • Return natural outcomes from the Google search outcomes web page.
  • Return the people-also-ask (associated questions) part from Google SERP.
  • Return native outcomes knowledge from Google MAPS.

Keep in mind that the AI is just tasked with parsing the uncooked HTML knowledge, not doing the net scraping itself.



TLDR;

In the event you do not need to learn the entire submit, right here is the abstract of the professionals and cons of our experiment utilizing the OpenAI API (new GPT-4) mannequin for net scraping:

Professionals

  • New mannequin gpt-4-1106-preview is ready to scrape uncooked HTML knowledge completely. The bigger token window makes it potential simply to cross uncooked HTML to scrape.
  • OpenAI “operate calling” can return the precise response format that we’d like.
  • OpenAI “a number of operate calling” can return knowledge from a number of knowledge factors.
  • The power to scrape uncooked HTML positively an enormous plus in comparison with the event time when parsing manually.

Cons

  • The price remains to be excessive in comparison with utilizing different SERP API suppliers.
  • Be careful for the fee when passing the entire uncooked HTML. We nonetheless have to trim to scrape solely related components. In any other case, you must pay lots for the token utilization.
  • The pace remains to be too lengthy to make use of it for manufacturing.
  • For “hidden knowledge” that’s usually discovered on the script tag, further AJAX request, or upon performing an motion (e.g., clicking, scrolling), we nonetheless have to do it manually.

  • Since we will use OpenAI’s API, ensure to register and have your api_key first. You would possibly want your OpenAI org ID as nicely.

  • I am utilizing Python for this experiment, however be at liberty to make use of any programming language you need.

  • Since we need to return a constant JSON format, we’ll be utilizing new function calling feature from OpenAI, the place we are able to outline the response’s keys and values with a pleasant format.

  • We’ll use this mannequin gpt-4-1106-preview .

Primary Code

Be certain that to put in the OpenAI library first. Since I am utilizing Python, I would like

pip set up openai
Enter fullscreen mode

Exit fullscreen mode

I will additionally set up requests package deal to get the uncooked HTML

pip set up requests
Enter fullscreen mode

Exit fullscreen mode

Here is what our code base will appear like

import json
import requests
from openai import OpenAI

consumer = OpenAI(
  group='YOUR-OPENAI-ORG-ID',
  api_key='YOUR-OPENAI-API-KEY'
)

targetUrl="https://books.toscrape.com/" # Goal URL will all the time modifications
response = requests.get(targetUrl)
html_text = response.textual content
Enter fullscreen mode

Exit fullscreen mode



Stage 1: Scraping on good/easy structured net web page with AI

Let’s heat up first. We’ll goal the https://books.toscrape.com/ website first because it has a really clear construction that makes it simple to learn.

Screenshot toscrape books, first web scraping targets

Here is what our code seems to be like (with explanations under)

# Chat Completion API from OpenAI
completion = consumer.chat.completions.create(
  mannequin="gpt-4-1106-preview", # Be happy to alter the mannequin to gpt-3.5-turbo-1106
  messages=[
    {"role": "system", "content": "You are a master at scraping and parsing raw HTML."},
    {"role": "user", "content": html_text}
  ],
  instruments=[
          {
            "type": "function",
            "function": {
              "name": "parse_data",
              "description": "Parse raw HTML data nicely",
              "parameters": {
                'type': 'object',
                'properties': {
                    'data': {
                        'type': 'array',
                        'items': {
                            'type': 'object',
                            'properties': {
                                'title': {'type': 'string'},
                                'rating': {'type': 'number'},
                                'price': {'type': 'number'}
                            }
                        }
                    }
                }
              }
          }
        }
    ],
   tool_choice={
       "sort": "operate",
       "operate": {"title": "parse_data"}
   }
)

# Calling the info outcomes
argument_str = completion.selections[0].message.tool_calls[0].operate.arguments
argument_dict = json.hundreds(argument_str)
knowledge = argument_dict['data']

# Print in a pleasant format
for e book in knowledge:
    print(e book['title'], e book['rating'], e book['price'])
Enter fullscreen mode

Exit fullscreen mode

  • We’re utilizing the ChatCompletion API from OpenAI
  • Use mannequin: gpt-4-1106-preview
  • Utilizing the immediate “You’re a grasp at scraping and parsing uncooked HTML” and passing the raw_html to be analyzed.
  • At instruments parameter, we’re defining our imaginary operate to parse the uncooked knowledge. Remember to regulate the properties of parameters to return the precise format you need.

Right here is the consequence

We’re capable of scrape the title, ranking, and worth (precisely the info we outlined within the operate parameters above) of every e book.

Time to complete: ~15s

Compare web scraping results

Utilizing gpt-3.5

When switching to gpt-3.5-turbo-1106 , I’ve to regulate the immediate to be extra particular:

messages: {"position": "system", "content material": "You're a grasp at scraping and parsing uncooked HTML. Scrape ALL the e book knowledge outcomes"},

# And the operate description
"operate": {
              "title": "parse_data",
              "description": "Get all books knowledge from uncooked HTML knowledge",
}
Enter fullscreen mode

Exit fullscreen mode

With out mentioning “scrape ALL e book knowledge,” it is going to simply get the primary few outcomes.

Time to complete: ~9s



Stage 2: Parse natural outcomes from Google SERP with AI

The Google search outcomes web page just isn’t just like the earlier website. It has a extra sophisticated construction, unclear CSS class names, and contains many unknown knowledge within the uncooked HTML.

Goal URL: ‘https://www.google.com/search?q=coffee&gl=us

WARNING! At first, I simply parse every part from Google uncooked HTML, it seems it accommodates too many characters, which suggests extra tokens and extra prices!

Watchout your OpenAI billing usage

So, after few tries, I made a decision to trim solely the physique half and take away the type and script tag contents.

I’ve adjusted the immediate and performance parameters like this:

import re #further import for regex

response = requests.get('https://www.google.com/search?q=espresso&gl=us')
html_text = response.textual content

# Take away pointless half to forestall HUGE TOKEN value!
# Take away every part between <head> and </head>
html_text = re.sub(r'<head.*?>.*?</head>', '', html_text, flags=re.DOTALL)
# Take away all occurrences of content material between <script> and </script>
html_text = re.sub(r'<script.*?>.*?</script>', '', html_text, flags=re.DOTALL)
# Take away all occurrences of content material between <type> and </type>
html_text = re.sub(r'<type.*?>.*?</type>', '', html_text, flags=re.DOTALL)

completion = consumer.chat.completions.create(
  mannequin="gpt-4-1106-preview",
  messages=[
    {"role": "system", "content": "You are a master at scraping Google results data. Scrape top 10 organic results data from Google search result page."},
    {"role": "user", "content": html_text}
  ],
  instruments=[
          {
          "type": "function",
          "function": {
            "name": "parse_data",
            "description": "Parse organic results from Google SERP raw HTML data nicely",
            "parameters": {
              'type': 'object',
              'properties': {
                  'data': {
                      'type': 'array',
                      'items': {
                          'type': 'object',
                          'properties': {
                              'title': {'type': 'string'},
                              'original_url': {'type': 'string'},
                              'snippet': {'type': 'string'},
                              'position': {'type': 'integer'}
                          }
                      }
                  }
              }
            }
          }
        }
    ],
   tool_choice={
       "sort": "operate",
       "operate": {"title": "parse_data"}
   }
)

argument_str = completion.selections[0].message.tool_calls[0].operate.arguments
argument_dict = json.hundreds(argument_str)
knowledge = argument_dict['data']

for lead to knowledge:
    print(consequence['title'])
    print(consequence['original_url'] or '')
    print(consequence['snippet']  or '')
    print(consequence['position'])
    print('---')
Enter fullscreen mode

Exit fullscreen mode

  • First, we trim solely the chosen half.
  • Regulate the immediate to “You’re a grasp at scraping Google outcomes knowledge. Scrape prime 10 natural outcomes knowledge from Google search consequence web page.”
  • Regulate the operate parameters to any format you want.

basic web scraping on Google SERP with AI

Ta-da! we get precisely the info we’d like regardless of of the sophisticated format from Google uncooked HTML.

Time to complete: ~28s

Notes: My authentic immediate was “Parse natural outcomes from Google SERP uncooked HTML knowledge properly” It solely returns the primary 3-5 outcomes, so I modify the immediate to get extra preices quantity of outcomes.

Utilizing gpt-3.5 mannequin

I am not in a position to do that because the uncooked HTML knowledge quantity exceeds the token size.



Stage 3: Parse native place outcomes from Google Maps with AI

Now, let’s scrape one other Google product, which is Google Maps. That is our goal web page: https://www.google.com/maps/search/coffee/@40.7455096,-74.0083012,14z?hl=en&entry=ttu

Google Maps screenshot

As you possibly can see, every of the gadgets contains many data. We’ll scrape:

– Identify

– Ranking common

– Whole ranking

– Value

– Deal with

– Extras

– Hours

– Further service

– Thumbnail picture

Warning! It seems, Google Maps load this knowledge through Javascript, so I’ve to alter my technique to get the raw_html from utilizing requests to selenium for python.

Code

Set up Selenium on Python. Extra directions on set up are here.

pip set up selenium
Enter fullscreen mode

Exit fullscreen mode

Import Selenium

from selenium import webdriver
from selenium.webdriver.frequent.keys import Keys
from selenium.webdriver.frequent.by import By
Enter fullscreen mode

Exit fullscreen mode

Create a headless browser occasion to surf the net

target_url="https://www.google.com/maps/search/espresso/@40.7455096,-74.0083012,14z?hl=en"
op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(choices=op)
driver.get(target_url)

driver.implicitly_wait(1) # seconds

# get uncooked html
html_text = driver.page_source

# You'll be able to proceed like earlier technique, the place we trim solely the physique half first
Enter fullscreen mode

Exit fullscreen mode

I wait 1 second with implicitly_wait to verify the info is already there to scrape.

Now, here is the OpenAI API operate:

completion = consumer.chat.completions.create(
  mannequin="gpt-4-1106-preview",
  messages=[
    {"role": "system", "content": "You are a master at scraping Google Maps results. Scrape all local places results data"},
    {"role": "user", "content": html_text}
  ],
  instruments=[
          {
          "type": "function",
          "function": {
            "name": "parse_data",
            "description": "Parse local results detail from Google MAPS raw HTML data nicely",
            "parameters": {
              'type': 'object',
              'properties': {
                  'data': {
                      'type': 'array',
                      'items': {
                          'type': 'object',
                          'properties': {
                              'position': {'type': 'integer'},
                              'title': {'type': 'string'},
                              'rating': {'type': 'string'},
                              'total_reviews': {'type': 'string'},
                              'price': {'type': 'string'},
                              'type': {'type': 'string'},
                              'address': {'type': 'string'},
                              'phone': {'type': 'string'},
                              'hours': {'type': 'string'},
                              'service_options': {'type': 'string'},
                              'image_url': {'type': 'string'},
                          }
                      }
                  }
              }
            }
          }
        }
    ],
   tool_choice={
       "sort": "operate",
       "operate": {"title": "parse_data"}
   }
)


argument_str = completion.selections[0].message.tool_calls[0].operate.arguments
argument_dict = json.hundreds(argument_str)
knowledge = argument_dict['data']

print(knowledge)
Enter fullscreen mode

Exit fullscreen mode

End result:

local maps API scraping with AI results

It seems good! I can get the precise knowledge for every of this local_results.

Time to complete with Selenium: ~47s

Time to complete (exclude Selenium time): ~34s



Stage 4: Parsing two completely different knowledge (natural outcomes and people-also-ask part) from Google SERP with AI

As you would possibly know, Google SERP would not simply show natural outcomes, but additionally different knowledge like adverts, people-also-ask (associated questions), data graphs, and so forth.

Let’s examine easy methods to goal a number of knowledge factors with a number of capabilities calling options from OpenAI.

Here is the code


completion = consumer.chat.completions.create(
  mannequin="gpt-4-1106-preview",
  messages=[
    {"role": "system", "content": "You are a master at scraping Google results data. Scrape two things: 1st. Scrape top 10 organic results data and 2nd. Scrape people_also_ask section from Google search result page."},
    {"role": "user", "content": html_text}
  ],
  instruments=[
          {
          "type": "function",
          "function": {
            "name": "parse_organic_results",
            "description": "Parse organic results from Google SERP raw HTML data nicely",
            "parameters": {
              'type': 'object',
              'properties': {
                  'data': {
                      'type': 'array',
                      'items': {
                          'type': 'object',
                          'properties': {
                              'title': {'type': 'string'},
                              'original_url': {'type': 'string'},
                              'snippet': {'type': 'string'},
                              'position': {'type': 'integer'}
                          }
                      }
                  }
              }
            }
          }
        },
          {
          "type": "function",
          "function": {
            "name": "parse_people_also_ask_section",
            "description": "Parse `people also ask` section from Google SERP raw HTML",
            "parameters": {
              'type': 'object',
              'properties': {
                  'data': {
                      'type': 'array',
                      'items': {
                          'type': 'object',
                          'properties': {
                              'question': {'type': 'string'},
                              'original_url': {'type': 'string'},
                              'answer': {'type': 'string'},
                          }
                      }
                  }
              }
            }
          }
        }
    ],
    tool_choice="auto"
)


# Organic_results
argument_str = completion.selections[0].message.tool_calls[0].operate.arguments
argument_dict = json.hundreds(argument_str)
organic_results = argument_dict['data']

print('Natural outcomes:')

for lead to organic_results:
    print(consequence['title'])
    print(consequence['original_url'] or '')
    print(consequence['snippet']  or '')
    print(consequence['position'])
    print('---')

# Folks additionally ask
argument_str = completion.selections[0].message.tool_calls[1].operate.arguments
argument_dict = json.hundreds(argument_str)
people_also_ask = argument_dict['data']

print('Folks additionally ask:')
for lead to people_also_ask:
    print(consequence['question'])
    print(consequence['original_url'] or '')
    print(consequence['answer']  or '')
    print('---')

Enter fullscreen mode

Exit fullscreen mode

Code Clarification:

  • Regulate the immediate to incorporate particular data on what to scrape: “You’re a grasp at scraping Google outcomes knowledge. Scrape two issues: 1st. Scrape prime 10 natural outcomes knowledge and 2nd. Scrape the the people_also_ask part from the Google search consequence web page.”
  • Including and separating capabilities, one for natural results_,_ and one for people-also-ask part.
  • Testing the output in two completely different codecs.

Here is the consequence:

Scraping multiple data points with AI

Success:

I can scrape each the organic_results and people_also_ask individually. Kudos to OpenAI!

Drawback:

I am not capable of scrape the reply and authentic URL for the people_also_ask part. The reason being that this data is hidden someplace within the script tag. We might do that by offering that particular a part of the script content material, however I might contemplate it dishonest for this experiment, as we need to cross the uncooked HTML with out pinpointing or giving a touch.

Time to complete: ~30s

If you wish to learn to scrape these knowledge cheaper, sooner, and extra correct. You’ll be able to learn these posts:



Desk comparability with SerpApi

Here is the timetable comparability utilizing OpenAI’s new GPT-4 mannequin for net scraping VS SerpApi . We’re evaluating with the ‘regular pace’; SerpApi has sooner (roughly twice as quick) when utilizing Ludicrous Speed.

Topic gpt-4-1106-preview SerpApi
Natural outcomes 15s 2.4s
Natural outcomes with Associated questions 30s 2.4s
Maps native outcomes 47s 2.7s



Conclusion

OpenAI positively improves plenty of over time. Now, it is potential to scrape a web site and accumulate related knowledge with the API. However primarily based on the time it takes, it is not but prepared for manufacturing, industrial objective, or scale. Whereas the accuracy of the info and response format seems good, it is nonetheless far behind by way of value and pace.

Let me know you probably have any ideas, spot any errors, and different experiments you need to add which can be related to this submit to hilman(at)serpapi(dot)com.

Thanks for studying this submit!

Add a Comment

Your email address will not be published. Required fields are marked *

Want to Contribute to us or want to have 15k+ Audience read your Article ? Or Just want to make a strong Backlink?