How to Parse HTML in Node.js

parsing HTML is a common task, especially when we need to extract data or manipulate the DOM from a webpage. Mastering various ways to parse HTML in Node.js can significantly improve the efficiency of extracting and processing webpage data. This article will introduce how to parse HTML in Node.js.

Home > Blog > How to Parse HTML in Node.js

In web development, parsing HTML is a common task, especially when we need to extract data or manipulate the DOM from a webpage. Mastering various ways to parse HTML in Node.js can significantly improve the efficiency of extracting and processing webpage data. This article will introduce how to parse HTML in Node.js.

Basic Concepts

HTML parsing refers to converting HTML text into operable data structures, usually the DOM (Document Object Model). The DOM is a tree-like structure that represents the structure and content of a webpage, allowing us to manipulate and modify the webpage using JavaScript.

Common HTML Parsing Methods

Here are some commonly used HTML parsing methods in Node.js:

Cheerio

Cheerio is a jQuery-like library that can use CSS selectors to parse HTML and manipulate the DOM on the server side. It is suitable for parsing static HTML pages.

jsdom

jsdom is a library that emulates a DOM environment in Node.js. It can parse and manipulate HTML while also supporting many features found in a browser environment such as event handling and asynchronous requests.

htmlparser2

htmlparser2 is a fast HTML parser that can parse an HTML document into a stream of DOM nodes. It is commonly used to process large HTML documents or streaming data.

Practical Example: Parsing HTML with Cheerio

Here is a practical example of using Cheerio to parse HTML, including basic routing and request handling. Make sure Node.js and npm are installed in your development environment.

1.First,create a new folder and run the following command to initialize the project:

npm init -y

2.Install the required dependencies:

npm install express cheerio axios

3.Create a file named index.js and write the following code:

const express = require('express');
const axios = require('axios'); 
const cheerio = require('cheerio');

const app = express();
const PORT = 3000;

app.get('/', async (req, res) => {
  try {
    // Use Axios to make a GET request for the HTML content of a webpage  
    const response = await axios.get('https://example.com');
    
    const html = response.data;  
   
    const $ = cheerio.load(html);
    
    // Use Cheerio selectors to get title and extract text  
    const title = $('title').text();
           
    res.send(`Title: ${title}`);  
  } catch (error) {
    console.error(error);
    res.status(500).send('An error occurred');  
  }
});

app.listen(PORT, () => {
  console.log(`Server running on port ${PORT}`);
});

The comments explain the purpose of each key step:

  • Make a GET request with axios.get() to get the HTML content of the webpage.
  • Create a Cheerio object for selecting DOM elements with $ = cheerio.load(html).
  • Use jQuery-like selectors with $() to get text content of <title> element.
  • Finally, send the extracted title as response to the client.

In this example, we use Express to create a simple server that fetches the HTML content of a webpage with Axios and then parses and extracts the title using Cheerio when accessing the root route. Visiting http://localhost:3000/ in a browser or API tool will show the response.

Tips, Tricks, and Considerations

  • Thoroughly read the documentation and usage for libraries like Cheerio, jsdom, and htmlparser2 to take full advantage of their functionality when using them.
  • Consider using a library that emulates browser behavior like Puppeteer when parsing complex dynamic pages.

Conclusion

Node.js provides multiple methods for HTML parsing, including Cheerio, jsdom, and htmlparser2. Choosing the library that suits your needs allows for easy manipulation and extraction of webpage content.

References:

Learn more:

Article Link:https://zguyun.com/blog/how-to-parse-html-in-nodejs/
ZGY:Share more interesting programming and AI knowledge.