Extract data from HTML XML documents with Python BeautifulSoup4

XML and namespace

This is a quick note on BeautifulSoup4. beautifulsoup4 is a python module that is commonly used to traverse document trees to extract data.

Helpful for data scientists.

On Termux, install some helper modules and parsers. For HTML, should use parser lxml (very fast parser, use it).

pip3 install requests lxml html5lib eng_to_ipa

eng_to_ipa is for converting eng to ipa (pronunciation)
requests is for sending http requests etc.

To scrape data from a web page is sometimes not allowed or illegal.

So far, extract data from wikipedia.com seems to be OK.

Then install beautifulsoup4

pip3 install beautifulsoup4

For example with HTML:

bs4 import BeautifulSoup  
import requests  
import eng_to_ipa as IPA  
  
  
def getIpa(word): return '/' + IPA.convert(word) + '/'  
def P(t): return str(t).replace('"', '\\"')  
  
def main(f):  
    html_content = open(f,'r').read()  
    soup = BeautifulSoup(html_content, "lxml")  
    # print(soup.prettify())  
      
    resJSon = '''  
    "list":  
    ['''  
      
    li = soup.find_all("li", attrs={"data-hw": True})  
  
...  
    

XML and namespace

Pass xml as the parser
BeautifulSoup(html_content, "xml") and use a CSS selector

Here is a code snippet from the doc: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

# from https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors  
  
from bs4 import BeautifulSoup  
xml = """<tag xmlns:ns1="http://namespace1/" xmlns:ns2="http://namespace2/">  
 <ns1:child>I'm in namespace 1</ns1:child>  
 <ns2:child>I'm in namespace 2</ns2:child>  
</tag> """  
soup = BeautifulSoup(xml, "xml")  
  
soup.select("child")  
# [<ns1:child>I'm in namespace 1</ns1:child>, <ns2:child>I'm in namespace 2</ns2:child>]  
  
soup.select("ns1|child", namespaces=soup.namespaces)  
# [<ns1:child>I'm in namespace 1</ns1:child>]

One of the most used methods is the find_all() method.

Method signature: find_all(name, attrs, recursive, string, limit, **kwargs)

arg of this method is something like a filter, so can pass regex or even function
name is the html tag name like p, div, span…
attrs is class, id, data-.. etc.

It also supports CSS Selector with .select(). It means we can select a nth tag counting from a particular tag, select children tags etc…

Reading the document alone is good enough to use it. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

A tutorial:

https://zetcode.com/python/beautifulsoup/