This is a quick note on BeautifulSoup4
.
beautifulsoup4
is a python module that is commonly used to traverse document trees to extract data.
Helpful for data scientists.
On Termux, install some helper modules and parsers. For HTML, should use parser lxml
(very fast parser, use it).
pip3 install requests lxml html5lib eng_to_ipa
eng_to_ipa
is for converting eng to ipa (pronunciation)requests
is for sending http requests etc.
To scrape data from a web page is sometimes not allowed or illegal.
So far, extract data from wikipedia.com seems to be OK.
Then install beautifulsoup4
pip3 install beautifulsoup4
For example with HTML:
bs4 import BeautifulSoup
import requests
import eng_to_ipa as IPA
def getIpa(word): return '/' + IPA.convert(word) + '/'
def P(t): return str(t).replace('"', '\\"')
def main(f):
html_content = open(f,'r').read()
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify())
resJSon = '''
"list":
['''
li = soup.find_all("li", attrs={"data-hw": True})
...
XML and namespace
Pass xml
as the parser
BeautifulSoup(html_content, "xml")
and use a CSS selector
Here is a code snippet from the doc: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
# from https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
from bs4 import BeautifulSoup
xml = """<tag xmlns:ns1="http://namespace1/" xmlns:ns2="http://namespace2/">
<ns1:child>I'm in namespace 1</ns1:child>
<ns2:child>I'm in namespace 2</ns2:child>
</tag> """
soup = BeautifulSoup(xml, "xml")
soup.select("child")
# [<ns1:child>I'm in namespace 1</ns1:child>, <ns2:child>I'm in namespace 2</ns2:child>]
soup.select("ns1|child", namespaces=soup.namespaces)
# [<ns1:child>I'm in namespace 1</ns1:child>]
One of the most used methods is the find_all()
method.
Method signature: find_all(name, attrs, recursive, string, limit, **kwargs)
- arg of this method is something like a filter, so can pass regex or even function
- name is the html tag name like p, div, span…
- attrs is class, id, data-.. etc.
It also supports CSS Selector with .select(). It means we can select a nth tag counting from a particular tag, select children tags etc…
Reading the document alone is good enough to use it. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
A tutorial: