Basics of Web Scraping and practical exercise
Useful links for self-learning:
https://www.youtube.com/watch?v=3xQTJi2tqgk
http://docs.python-requests.org
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Installations:
1. Install requests (make sure it's 'requests', not 'request')
pip install requests --upgrade
2. Install beautifulsoup4
pip install bs4
or,
pip install beautifulsoup4
If an older version is installed and if you want to update the package, do following:
pip install beautifulsoup4 --upgrade
Coding with Python:
Initial setup when you start coding>>
# import libraries
import requests
from bs4 import BeautifulSoup
url = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los+Angeles%2C+CA"
r = requests.get(url);
# r.content // careful about this step. it will scrap all data in an unstructured way
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(r.content, 'html.parser') // this is now usable with beautifulsoup
// Initial setup- up to the above step.
"""If you use, in the above step, soup = BeautifulSoup(r.content), i would say:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html.parser")
"""
print(soup.prettify()) // this will print the markup section into easy-to-understand format
==================
soup.find_all("a") // this will return a list of ALL <a>
======================
# to narrow down the search:
for link in soup.find_all("a"):
print(link)
// this will return all href elements of the page
====================
for link in soup.find_all("a"):
print(link.get("href"))
// this will return the urls only of all the links/href elements
============================
for link in soup.find_all("a"):
print(link.text)
// this will return all the texts
=================================
for link in soup.find_all("a"):
print(link.text, link.get("href"))
// this will return link text first and besides the associated link as well.
=======================================
for link in soup.find_all("a"):
"<a href='%s'>%s</a>" %(link.get("href"), link.text)
// this will retrun the link and link text within a defined pattern.
==============================================
links = soup.find_all("a")
for link in links:
if "http" in link.get("href"):
print("<a href='%s'>%s</a>" %(link.get("href"), link.text))
// if only 'http' is enlisted into href, the above will return defined output.
=====================================================
g_data = soup.find_all("div", {"class": "info"})
for item in g_data:
print(item.text)
// this will return the text of elements under the 'info' class under 'div'
=====================================
for item in g_data:
print(item.contents)
// list of contents of each item under the condition of g_data
================================
for counting the ratings:
for item in soup.find_all("span", "class": "count"):
print(item.text)
// this will return all the ratings of cafe shops.
============================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[0].text)
// this will return a list of the names of the cafe shops within that page.
========================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[1].text)
// this will return all the Text elements of the second content under div> class=info
=====================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[0].text)
print(item.contents[1].text)
========================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[0].find_all("a", {"class": "business-name"})[0].text)
// list of coffee shops, again.
====================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[0].find_all("a", {"class": "business-name"})[0].text)
print(item.contents[1].find_all("p", {"class": "adr"})[0].text)
// coffee shop name, then address, in the next line.
=================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[0].find_all("a", {"class": "business-name"})[0].text)
print(item.contents[1].find_all("p", {"class": "adr"})[0].text)
print(item.contents[1].find_all("div", {"class": "primary"})[0].text)
// coffee shop name, then address, and then the phone number.
===================
https://www.youtube.com/watch?v=3xQTJi2tqgk
http://docs.python-requests.org
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Installations:
1. Install requests (make sure it's 'requests', not 'request')
pip install requests --upgrade
2. Install beautifulsoup4
pip install bs4
or,
pip install beautifulsoup4
If an older version is installed and if you want to update the package, do following:
pip install beautifulsoup4 --upgrade
Coding with Python:
Initial setup when you start coding>>
# import libraries
import requests
from bs4 import BeautifulSoup
url = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los+Angeles%2C+CA"
r = requests.get(url);
# r.content // careful about this step. it will scrap all data in an unstructured way
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(r.content, 'html.parser') // this is now usable with beautifulsoup
// Initial setup- up to the above step.
"""If you use, in the above step, soup = BeautifulSoup(r.content), i would say:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html.parser")
"""
print(soup.prettify()) // this will print the markup section into easy-to-understand format
==================
soup.find_all("a") // this will return a list of ALL <a>
======================
# to narrow down the search:
for link in soup.find_all("a"):
print(link)
// this will return all href elements of the page
====================
for link in soup.find_all("a"):
print(link.get("href"))
// this will return the urls only of all the links/href elements
============================
for link in soup.find_all("a"):
print(link.text)
// this will return all the texts
=================================
for link in soup.find_all("a"):
print(link.text, link.get("href"))
// this will return link text first and besides the associated link as well.
=======================================
for link in soup.find_all("a"):
"<a href='%s'>%s</a>" %(link.get("href"), link.text)
// this will retrun the link and link text within a defined pattern.
==============================================
links = soup.find_all("a")
for link in links:
if "http" in link.get("href"):
print("<a href='%s'>%s</a>" %(link.get("href"), link.text))
// if only 'http' is enlisted into href, the above will return defined output.
=====================================================
g_data = soup.find_all("div", {"class": "info"})
for item in g_data:
print(item.text)
// this will return the text of elements under the 'info' class under 'div'
=====================================
for item in g_data:
print(item.contents)
// list of contents of each item under the condition of g_data
================================
for counting the ratings:
for item in soup.find_all("span", "class": "count"):
print(item.text)
// this will return all the ratings of cafe shops.
============================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[0].text)
// this will return a list of the names of the cafe shops within that page.
========================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[1].text)
// this will return all the Text elements of the second content under div> class=info
=====================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[0].text)
print(item.contents[1].text)
========================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[0].find_all("a", {"class": "business-name"})[0].text)
// list of coffee shops, again.
====================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[0].find_all("a", {"class": "business-name"})[0].text)
print(item.contents[1].find_all("p", {"class": "adr"})[0].text)
// coffee shop name, then address, in the next line.
=================
for item in soup.find_all("div", {"class": "info"}):
print(item.contents[0].find_all("a", {"class": "business-name"})[0].text)
print(item.contents[1].find_all("p", {"class": "adr"})[0].text)
print(item.contents[1].find_all("div", {"class": "primary"})[0].text)
// coffee shop name, then address, and then the phone number.
===================
Comments
Post a Comment