Parse HTML

Goal

This post aims to introduce how to parse the HTML data fetched by BeautifulSoup

Reference

Library

In [12]:
from bs4 import BeautifulSoup
import requests

Simple HTML from string

In [24]:
html_simple = '<h1>This is Title<h1>'
html_simple
Out[24]:
'<h1>This is Title<h1>'
In [25]:
soup = BeautifulSoup(html_simple)
In [26]:
soup.text
Out[26]:
'This is Title'
In [21]:
[c for c in dir(soup) if not c.startswith('_')]
Out[21]:
['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'HTML_FORMATTERS',
 'NO_PARSER_SPECIFIED_WARNING',
 'ROOT_TAG_NAME',
 'XML_FORMATTERS',
 'append',
 'attrs',
 'builder',
 'can_be_empty_element',
 'childGenerator',
 'children',
 'clear',
 'contains_replacement_characters',
 'contents',
 'currentTag',
 'current_data',
 'declared_html_encoding',
 'decode',
 'decode_contents',
 'decompose',
 'descendants',
 'encode',
 'encode_contents',
 'endData',
 'extend',
 'extract',
 'fetchNextSiblings',
 'fetchParents',
 'fetchPrevious',
 'fetchPreviousSiblings',
 'find',
 'findAll',
 'findAllNext',
 'findAllPrevious',
 'findChild',
 'findChildren',
 'findNext',
 'findNextSibling',
 'findNextSiblings',
 'findParent',
 'findParents',
 'findPrevious',
 'findPreviousSibling',
 'findPreviousSiblings',
 'find_all',
 'find_all_next',
 'find_all_previous',
 'find_next',
 'find_next_sibling',
 'find_next_siblings',
 'find_parent',
 'find_parents',
 'find_previous',
 'find_previous_sibling',
 'find_previous_siblings',
 'format_string',
 'get',
 'getText',
 'get_attribute_list',
 'get_text',
 'handle_data',
 'handle_endtag',
 'handle_starttag',
 'has_attr',
 'has_key',
 'hidden',
 'index',
 'insert',
 'insert_after',
 'insert_before',
 'isSelfClosing',
 'is_empty_element',
 'is_xml',
 'known_xml',
 'markup',
 'name',
 'namespace',
 'new_string',
 'new_tag',
 'next',
 'nextGenerator',
 'nextSibling',
 'nextSiblingGenerator',
 'next_element',
 'next_elements',
 'next_sibling',
 'next_siblings',
 'object_was_parsed',
 'original_encoding',
 'parent',
 'parentGenerator',
 'parents',
 'parse_only',
 'parserClass',
 'parser_class',
 'popTag',
 'prefix',
 'preserve_whitespace_tag_stack',
 'preserve_whitespace_tags',
 'prettify',
 'previous',
 'previousGenerator',
 'previousSibling',
 'previousSiblingGenerator',
 'previous_element',
 'previous_elements',
 'previous_sibling',
 'previous_siblings',
 'pushTag',
 'recursiveChildGenerator',
 'renderContents',
 'replaceWith',
 'replaceWithChildren',
 'replace_with',
 'replace_with_children',
 'reset',
 'select',
 'select_one',
 'setup',
 'string',
 'strings',
 'stripped_strings',
 'tagStack',
 'text',
 'unwrap',
 'wrap']
In [10]:
soup.h1
Out[10]:
<h1>This is Title<h1></h1></h1>
In [11]:
soup.h2

Parsing HTML webpage

In [13]:
target_html = 'https://h1ros.github.io/'
response = requests.get(target_html)
soup = BeautifulSoup(response.text)
soup.text
Out[13]:
'\n\n\n\nHome | Step-by-step Data Science\n\n\n\n\n\n\n\n\n\n  window.dataLayer = window.dataLayer || [];\n  function gtag(){dataLayer.push(arguments);}\n  gtag(\'js\', new Date());\n\n  gtag(\'config\', \'UA-134273341-1\');\n\n\n\n\n\n\n\n\n\nSkip to main content\n\n\n\n\n\nToggle navigation\n\n\n\n\n\nStep-by-step Data Science\n\n\n\n\n\n\nCoding Problems\n\n\nMachine Learning\n\n\nAll \n\n\nAll Post\n\n\nCategories and Tags\n\n\nHistory\n\n\n\n\nRSS\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n Step by Step \n\n\n\n\n\n\n\n\nAbout me\n\nI am Hiro, who is passionate about data science and deep learning. I have a few years of industry and research experinence in machine learning. By explaining things to learn, I would like to accerelate my learning process, but step by step.  \n\t\t\n\n\n\n\nExpected Readers\n\nI mainly write about it for myself but hope it would be benefitial for those who learn machine learning and coding. I am very happy to work and think together if anyone has a question. Feel free to leave a comment or use chat located at the right bottom.  \n\t\t\n\n\n\n\n\n\n\n\n\n\n\nCoding Problem\n"Compare yourself with who you were yesterday"\n\nEvery Sturday I join LeetCode Weekly Contest and improve coding skill by solving coding problems. I know there are a lot of better coders in the world but I compare myself who I was yesterday to move forward.\n\n\n\n\nDynamic Programming\n\n\nBinary Tree\n\nRecursive\n      \t\nLinked List\n      \t\nHeap\n      \n\n\n\n\n\n\nMachine Learning\n"The best way to learn is to explain"\n\nEven if we can use them, we do not fully understand the things. I explain the things I used for my daily job as well as the ones that I would like to learn. \n\n\nPreprocessing\n\n\nVisualization\n\n\nText\n\nAudio\n      \t\nImages\n    \t\nDeep Learning\n\n      \n\n\n\n\n\n\n\n\n\n\n            Contents © 2019         h1ros - Powered by         Nikola\n\n\n\n\n  ((window.gitter = {}).chat = {}).options = {\n    room: \'h1ros-github-io/ama\'\n  };\n\n    moment.locale("en");\n    fancydates(1, "YYYY-MM-DD");\n    \n    baguetteBox.run(\'div#content\', {\n        ignoreClass: \'islink\',\n        captions: function(element) {\n            return element.getElementsByTagName(\'img\')[0].alt;\n    }});\n    window.twttr = (function(d, s, id) {\n  var js, fjs = d.getElementsByTagName(s)[0],\n    t = window.twttr || {};\n  if (d.getElementById(id)) return t;\n  js = d.createElement(s);\n  js.id = id;\n  js.src = "https://platform.twitter.com/widgets.js";\n  fjs.parentNode.insertBefore(js, fjs);\n  t._e = [];\n  t.ready = function(f) {\n    t._e.push(f);\n  };\n  return t;\n}(document, "script", "twitter-wjs"));\n\n\n'

Collect all href in tag

In [18]:
all_urls = []
for a in soup.find_all('a'):
    all_urls.append(a.attrs['href'])
    
all_urls
Out[18]:
['#content',
 '.',
 'categories/coding',
 'categories/machine-learning',
 '#',
 'posts/',
 'categories/',
 'archive.html',
 'rss.xml',
 '.',
 '.',
 'categories/coding/',
 '.',
 'categories/coding/',
 'categories/dynamic-programming',
 'categories/binary-tree',
 'categories/machine-learning/',
 '.',
 'categories/machine-learning/',
 'categories/peprocessing',
 'categories/visualization',
 'categories/text',
 'mailto:data.h1ros@gmail.com',
 'https://getnikola.com']

Comments

Comments powered by Disqus