Skip to main content

Let's Brew the Soup aka BeautifulSoup - A web scraping journey

Python is a versatile programming language that has gained widespread popularity for its diverse applications, including web scraping and data extraction. It was during the COVID-19 pandemic in 2020 that I developed a keen interest in Robotic Process Automation (RPA). Intrigued by the potential of RPA, I considered acquiring a UiPath license, which is a leading RPA platform. However, due to financial constraints, I was unable to proceed with the purchase.

Initially, I felt disheartened by the realization that web scraping through traditional means might not be as efficient as the automated bots offered by RPA solutions. These bots not only possess robust web scraping capabilities but also offer additional features such as email notifications and seamless file conversion to formats like CSV and XLS.

It was then that I discovered the BeautifulSoup module in Python, a powerful tool designed specifically for web scraping. The installation process for this module is straightforward, making it accessible to a wide range of users. BeautifulSoup has proven to be a valuable resource, enabling me to extract data from websites with relative ease, albeit without the advanced functionalities provided by RPA bots.

Despite the initial setback, my exploration of Python's web scraping capabilities has been an enriching journey, showcasing the language's versatility and the vast array of tools available to developers and enthusiasts alike. 

pip3 install bs4

The process of web scraping with Python's BeautifulSoup module involves several steps. First, it is necessary to import the required libraries: bs4 for the BeautifulSoup parser and requests for fetching the HTML content of a webpage. Once the libraries are imported, the URL of the target website is passed to the requests library to retrieve the HTML code, which is then parsed by BeautifulSoup.

The parsed HTML is typically assigned to variables for further manipulation. The next step involves identifying the relevant HTML tags that contain the desired data, such as headers (<th>), rows (<tr>), and table cells (<td>). This can be achieved by leveraging the browser's inspection tool, which allows you to inspect the HTML structure of the webpage and locate the specific table or data elements you wish to scrape.

If the target data is nested within complex HTML structures, indexing techniques may be employed to navigate and extract the desired information accurately. An example code snippet is provided below to illustrate the concept and facilitate a better understanding of the web scraping process using BeautifulSoup.

Ultimately, this simple web scraping approach, which utilizes Python modules, aims to extract data from web pages, organize it into a structured format (such as a table or a list), and optionally save it as a CSV file or any other desired format for further analysis or processing.




You can save this as csv or other formats.
df.to_csv('/......../...../...../Fortune.csv', index=False)


Comments

Popular posts from this blog

The Git Life: Your Guide to Seamless Collaboration and Control

A Comprehensive Guide to Git: From Basics to Advanced   What is Git and GitHub?   Imagine you are organizing a wedding —a grand celebration with many family members, friends, and vendors involved. You need a foolproof way to manage tasks, keep track of who is doing what, and ensure that everyone stays on the same page. This is where Git and GitHub come in, though in the world of technology.   What is Git?   Git is like the wedding planner or the master ledger for managing all wedding-related activities. Think of it as a system that helps you:      1.   Keep track of every change made (like noting down who ordered the flowers or printed the invitation cards).       2.   Maintain a record of what changes happened and who made them (e.g., the uncle who updated the guest list).       3.   Go back to an earlier version if something goes wrong (...

How to Open Jupyter Lab in your favourite browser other than system default browser in Mac OS: A Step-by-Step Guide

Are you tired of Jupyter Lab opening in your default browser? Would you prefer to use Google Chrome or another browser of your choice? This guide will walk you through the process of configuring Jupyter Lab to open in your preferred browser, with a focus on using Google Chrome. The Challenge   Many tutorials suggest using the command prompt to modify Jupyter's configuration. However, this method often results in zsh errors and permission issues, even when the necessary permissions seem to be in place. This guide offers a more reliable solution that has proven successful for many users.   Step-by-Step Solution   1. Locate the Configuration File - Open Finder and navigate to your user folder (typically named after your username). - Use the keyboard shortcut Command + Shift + . (full stop) to reveal hidden folders. - Look for a hidden folder named .jupyter . - Within this folder, you'll find the jupyter_notebook_config.py file.   2. Edit the Configuration File - Open ...

Streamlit - An interactive app guide for Data Scientists and ML Engineers

Streamlit: A Guide to Create an Interactive App Introduction to Streamlit:   What is Streamlit? Streamlit  is an open-source Python library that allows you to build interactive and data-driven web applications with minimal effort. It is widely used in data science, machine learning, and analytics to create quick and interactive dashboards without requiring web development knowledge.   Why to use Streamlit? •                  Easy to use: No front-end knowledge required. •                  Quick development: Turn Python scripts into web apps instantly. •                  Interactive widgets: Built-in support for user interaction. •                  Ideal for ...