Mar 22, 2023

The Secret to Training Large Language Models

Web scraping is a critical skill for training large language models. Here are the basics of web scraping and how to use it to collect and store data in a structured form.

AI ALGORITHM DATA

Santiago

Machine Learning. I run https://t.co/iZifcK7n47 and write @0xbnomial.

Member of Software Developers

Web scraping is a critical skill, and yet nobody talks about it.

How do you think companies are training their Large Language Models? Where do you think the data come from?

While most people worry about better prompts, here is what you need to know about building the engine:
— Santiago (@svpino) March 22, 2023
Web scraping allows you to get a lot of data from websites
at scale.

This data is unstructured. You use web scraping to collect and store it in a structured form like CSV or JSON.

This is a critical skill that will open many doors for you.
— Santiago (@svpino) March 22, 2023
Here is a summary of how the process works:

• Send a request to the URL you want to scrape
• Server sends the HTML back.
• Your code parses the HTML and collects the data.

Rinse and repeat for every URL you want to scrape data from.
— Santiago (@svpino) March 22, 2023
Libraries you can use for web scraping:

• Selenium: A web testing library used to automate browser activities.

• Playwright: Another library used for testing modern web apps.

• BeautifulSoup: A library used to parse HTML documents.

But there's only one problem:
— Santiago (@svpino) March 22, 2023
Most people quickly find out the biggest problem with web scraping:

Websites often block your IP, making it difficult to access public web data.

It quickly becomes a cat-and-mouse game and a royal waste of time.

Fortunately, there's a solution:
— Santiago (@svpino) March 22, 2023
I work with @bright_data. They gave me access to their new Scraping Browser API.

This is a game-changer!

Their global proxy network allows you to collect data from unique IPs, making the data collection process fast, easy to manage, and scalable.

Here is one example:
— Santiago (@svpino) March 22, 2023
I wrote some code using Playwright and @bright_data to read the contents of my website.

It uses one of @bright_data's 72M proxies to visit my site.

For the website, this will look like a unique visitor and not like a Python web scraper. pic.twitter.com/Je0b2G6FrV
— Santiago (@svpino) March 22, 2023
If you need structured web data, try this:

1. Create an account: https://t.co/OGiIObWSCz

2. Grab your username and password, and use them here:https://t.co/kPPmTwxC17

3. Start collecting web data. It's that easy!

Thanks to @bright_data for helping me bring this to you!
— Santiago (@svpino) March 22, 2023