How to convert HTML to plain text using JavaScript?
This article discusses how to convert HTML to formatted plain text using JavaScript, including methods such as using regular expressions, DOM parsing, and third-party libraries.
HTML and plain text are both forms of representation for web content, but there are significant differences between them. HTML is a markup language used to define the structure and content of web pages, while plain text is simple text that does not contain any formatting or markup. Although HTML can make web pages look beautiful and interactive, sometimes we need to convert them into plain text, such as in web crawlers and Data Analysis, where HTML content extracted from web pages often needs to be converted into plain text for further processing and analysis. This article will introduce how to use JavaScript to achieve this conversion.
The principle of converting HTML to plain text
To understand how to convert HTML to plain text, you first need to understand the structure of HTML tags and the characteristics of plain text. HTML tags contain various tags and attributes, while plain text is simple text content without any tags. Therefore, the basic idea of conversion is to remove HTML tags and only retain the text content.
Using JavaScript to Convert HTML to Plain Text
When using JavaScript to convert HTML to plain text, there are many methods to choose from. Here are some common methods:
1. Use Regular Expressions
Regular Expression is a powerful text matching tool that can be used to match and delete HTML tags. This method is simple and direct, suitable for simpler HTML structures.
Example code:
function htmlToPlainText(html) {
return html.replace(/<[^>]*>/g, '');
}
var htmlString = '<p>This is <strong>bold</strong> and <em>italic</em>.</p>';
var plainText = htmlToPlainText(htmlString);
console.log(plainText);
In this example, the htmlToPlainText
function uses Regular Expression /<[^>]*>/ g
to match and remove HTML tags, thereby converting HTML to plain text. This Regular Expression /<[^>]*>/ g
is used to match HTML tags, and its meaning is explained as follows:
<
: Matches a left angle bracket to indicate the beginning of the HTML tag.[^ >]
: matches any character except the right angle bracket.^
here means negation, that is, matches any character except the right angle bracket.*
: Matches the preceding expression zero or more times, that is, matches any number of characters other than the right angle bracket.>
: Matches a right angle bracket indicating the end of the HTML tag./
: The slash symbol is used as the delimiter for Regular Expressions.G
: means global match,which matches all strings that meet the condition, not just the first one.
Thus, the meaning of the entire Regular Expression is to match HTML tags (beginning with a left angle bracket <
and ending with a right angle bracket >
), which can contain any number of characters other than the right angle bracket.
2. Using DOM Parsing
DOM (Document Object Model) is the standard interface for JavaScript to manipulate HTML documents. It can use a DOM parser to traverse HTML elements and extract plain text content. This method is more flexible and suitable for complex HTML structures.
Example code:
function htmlToPlainText(html) {
var temp = document.createElement('div');
temp.innerHTML = html;
return temp.innerText || temp.textContent;
}
var htmlString = '<p>This is <strong>bold</strong> and <em>italic</em>.</p>';
var plainText = htmlToPlainText(htmlString);
console.log(plainText);
In this example, we first create a temporary < div >
element and then assign the HTML string to its innerHTML
attribute. Finally, we get the plain text content through the innerText
or textContent
attribute.
3. Use third-party libraries
In addition to native JavaScript methods, third-party libraries can also be used to simplify the conversion process from HTML to plain text. For example, using the jQuery library can make it easier to select and manipulate HTML elements.
Example code:
function htmlToPlainText(html) {
return $('<div>').html(html).text();
}
var htmlString = '<p>This is <strong>bold</strong> and <em>italic</em>.</p>';
var plainText = htmlToPlainText(htmlString);
console.log(plainText);
In this example, we use the html ()
method of the jQuery library to insert the HTML string into a temporary < div >
element, and then use the text ()
method to get the plain text content (remember to import jQuery into the HTML of the project).
Summary
Through the introduction in this article, we learned how to use JavaScript to convert HTML to formatted plain text. Although this process may have some challenges, JavaScript provides many convenient methods to achieve this goal. Converting HTML to plain text can improve the readability and accessibility of content, while also providing more possibilities for various application scenarios.
Reference link:
- JavaScript Documentation: https://developer.mozilla.org/en-US/docs/Web/JavaScript
Learn more: