Basics of Web Scraping and practical exercise

Useful links for self-learning:

https://www.youtube.com/watch?v=3xQTJi2tqgk

http://docs.python-requests.org

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Installations:
1. Install requests (make sure it's 'requests', not 'request')
    pip install requests --upgrade

2. Install beautifulsoup4
    pip install bs4
    or,
    pip install beautifulsoup4
    If an older version is installed and if you want to update the package, do following:
    pip install beautifulsoup4 --upgrade

Coding with Python:

Initial setup when you start coding>>

# import libraries
import requests
from bs4 import BeautifulSoup

url = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los+Angeles%2C+CA"

r  =  requests.get(url);

# r.content // careful about this step. it will scrap all data in an unstructured way

# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(r.content, 'html.parser') // this is now usable with beautifulsoup

// Initial setup- up to the above step.

"""If you use, in the above step, soup = BeautifulSoup(r.content), i would say:

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html.parser")
"""

print(soup.prettify()) // this will print the markup section into easy-to-understand format

==================

soup.find_all("a") // this will return a list of ALL <a>

======================
# to narrow down the search:

for link in soup.find_all("a"):
    print(link)
// this will return all href elements of the page

====================

for link in soup.find_all("a"):
    print(link.get("href"))
// this will return the urls only of all the links/href elements

============================
for link in soup.find_all("a"):
print(link.text)
// this will return all the texts

=================================

for link in soup.find_all("a"):
print(link.text, link.get("href"))

// this will return link text first and besides the associated link as well.
=======================================
for link in soup.find_all("a"):
"<a href='%s'>%s</a>" %(link.get("href"), link.text)
// this will retrun the link and link text within a defined pattern.

==============================================
links = soup.find_all("a")

for link in links:
        if "http" in link.get("href"):
            print("<a href='%s'>%s</a>" %(link.get("href"), link.text))
// if only 'http' is enlisted into href, the above will return defined output.
=====================================================
g_data = soup.find_all("div", {"class": "info"})

for item in g_data:
  print(item.text)

// this will return the text of elements under the 'info' class under 'div'
=====================================
for item in g_data:
  print(item.contents)
// list of contents of each item under the condition of g_data

================================
for counting the ratings:
for item in soup.find_all("span", "class": "count"):
    print(item.text)

// this will return all the ratings of cafe shops.

============================
for item in soup.find_all("div", {"class": "info"}):
    print(item.contents[0].text)

// this will return a list of the names of the cafe shops within that page.

========================
for item in soup.find_all("div", {"class": "info"}):
    print(item.contents[1].text)

// this will return all the Text elements of the second content under div> class=info

=====================
for item in soup.find_all("div", {"class": "info"}):
    print(item.contents[0].text)
    print(item.contents[1].text)

========================
for item in soup.find_all("div", {"class": "info"}):
    print(item.contents[0].find_all("a", {"class": "business-name"})[0].text)
// list of coffee shops, again.

====================
for item in soup.find_all("div", {"class": "info"}):
    print(item.contents[0].find_all("a", {"class": "business-name"})[0].text)
    print(item.contents[1].find_all("p", {"class": "adr"})[0].text)

// coffee shop name, then address, in the next line.

=================

for item in soup.find_all("div", {"class": "info"}):
    print(item.contents[0].find_all("a", {"class": "business-name"})[0].text)
    print(item.contents[1].find_all("p", {"class": "adr"})[0].text)
    print(item.contents[1].find_all("div", {"class": "primary"})[0].text)

// coffee shop name, then address, and then the phone number.

===================




















Comments