Plain Text vs innerText vs textContent

September 1st, 2010 by Mike Wilcox

innerText and textContent are properties that get or set the text of an element or all its children. Internet Explorer implemented innerText in version 4.0, and it’s a useful, if misunderstood feature. WebKit also has innerText, carefully copying from, and even improving upon IE; and additionally has the standards compliant textContent, which we shall see, is no where near as useful and is in fact quite different. Firefox has textContent but not innerText, and a common mistake is writing code that retrieves one or the other, assuming the result will be the same (it’s not). Opera has the property, but it is little more than an alias of textContent, which to me is analogous to false advertising.

The Basics

The most common usage for these properties is while working on a rich text editor, when you need to “get the plain text” or for other functional reasons. innerText and textContent work well when retrieving the contents of a single node. And there is no issue with setting with either property, since they effectively wipe out all the child nodes and replace them with plain text, as shown in these simple examples:


<div id="myDiv">Hello World</div>
var node = document.getElementById("myDiv");
alert(node.textContent || node.innerText); // Hello World
if(node.innerText){
    node.innerText = "Goodbye cruel world";
}else{
    node.textContent = "Goodbye cruel world";
 }

And using the above method can be an effective way of preventing script injections with user-entered text:


userComment =
    '<a onclick="doSomethingBad();" href="#">See this interesting site</a>';
if(node.innerText){
    node.innerText = userComment;
}else{
    node.textContent = userComment;
}
//result:
<a onclick="doSomethingBad();" href="#">See this interesting site</a>

By not using innerHTML or something similar, the text is shown literally and not parsed, so any sneak attack is ineffective, and in fact, quite visible.

The Problem

The primary focus of this post is in getting innerText/textContent from multiple nodes. Take the following example:


<div id="example1">
	<p>para1</p><p>para2</p>
</div>

IE’s innerText predictably shows “para1″ and “para2″ with a line break in-between. But Firefox’s textContent does not:


// IE:
para1
para2
//FF:
  							para1para2
 

Whoa! What gives? The W3C explanation of textContent is as follows:

On getting, no serialization is performed, the returned string does not contain any markup. No whitespace normalization is performed and the returned string does not contain the white spaces in element content.

Because “no whitespace normalization is performed”, what textContent is essentially doing is acting like a PRE element. The markup is stripped, but otherwise what we get is exactly what was in the HTML document — including tabs, spaces, lack of spaces, and line breaks. It’s getting the source code from the HTML! What good this is, I really don’t know.

The Test

The previous example was quite simple. Here is some markup that touches on all areas affected by this issue:


<div id="example">
	<table>
		<tr>
			<td>TL</td>
			<td>TR</td>
		</tr>
		<tr>
			<td>BL</td>
			<td>BR</td>
		</tr>
	</table>
	<p>
		Start a paragraph, <span>then a span</span> end para.
	</p>
	<p>
		New Para then    4 spaces, then a line<br/>break.
	</p>
	<p style="display:none;">
		This should not show (display:none).
	</p>
	<!-- <p>This should not show (comment).</p>-->
	<p>
		This is text before<span style="display:block;">a span
                with display:block.</span>...and after.
	</p>
	<p id="preNode" style="white-space:pre;">
This			text
		is pre
	text.
	</p>
	<script>var a = "Mike"</script>
	<ul>
		<li>one bullet</li>
		<li>two bullet</li>
		<li>three bullet</li>
	</ul>
</div>

The following are three screen shots from the results of performing innerText with Safari and IE and textContent with Firefox. IE6 & IE7 were the same, and only slightly different from IE8. Chrome was the same as Safari. Opera was the same as Firefox. The major differences are shown below:

innerText textContent comparison

As you may have guessed, the differences between grabbing the text with textContent and innerText vary wildly. But worse, innerText varies between WebKit and IE, and there are even differences between IE7 and IE8. WebKit places tabs between the TDs whereas IE doesn’t even use a space. WebKit does not use the text marked display:none whereas IE does. WebKit is also more liberal with line breaks. And IE’s handling of PRE text is so brittle and buggy that it can’t be used; which seems kind of ironic in this situation.

Note that textContent returns exactly what is in markup, so theoretically we have control over it. If the page was written with no tabs and no line breaks, we’d be golden!! But we’d also be in maintenance hell, so that’s not worth considering.

The Solution

The original idea I had was to “fix” textContent and make it look like the results of innerText. However, as the test showed, the differences between WebKit and IE made that a bad idea since they were not the same. So I opted for a solution that returned “plain text”, with the measurement of success being not so much matching innerText, but looking consistent across browsers.

What triggered this latest attempt was the realization that the results are keyed off of the node’s display style. block styles should have a line break, and inline styles should need a space.

The code recursively loops through all of the child nodes, grabbing their text, noting what type of display used, and adds line breaks for blocks and spaces for inline. This was pretty close, but of course, the browsers will fight you every step of the way. If display was not specifically set in the node’s cssText, IE7 would return block for everything, and WebKit would return an empty string for everything. Firefox and IE8 both properly returned the appropriate display type for the node used — in a rare case of browser unity! So my code exhausts all efforts to get the proper display type, and then resorts to mapping display types to tagNames; which should work in most cases, since if the style is not set, the type should be predictable. There are only three block elements in this case: LI, TR, and P. All others are either inline or are not relevant.

And that’s the heart of the code, though there is a bit more in the details. script elements and those with display:none are ignored. A lot of spaces and line breaks are added and doubled up, so when the text is done, I run some regular expressions on it to clean that up. After P elements there should be a double line break after (based on WebKit’s innerText results and the fact that it makes the result cleaner). A “NEWLINE” placeholder string is for a used so that the double line break is not removed by the regular expressions. A quick search and replace was used for BR elements, since they are ignored by textContent. And finally, PRE text is normalized like the rest of the plain text. It made a certain amount of sense that you would want the formatting to match everything else; and as I noted earlier, I attempted to get IE to keep the PRE format, but ended up just chasing my tail while observing a new batch of IE bugginess.

The final results are indeed consistent across browsers:

Conclusion

So as we have learned, textContent and innerText are not created equal and in fact are worlds apart. In fact, I helped propose to the HTML5 WHATWG that innerText become a standard. In the meantime hopefully, my getPlainText() function will be of help. To see the test in action on your browser, visit the examples page here. If you find any bugs or have suggestions on how to improve it, please let me know in the comments or by dropping me an email.

Tags: , , , , , , , , , , ,

5 Responses to “Plain Text vs innerText vs textContent”

  1. Derek says:

    Great article. Thanks.

  2. Mike Pirnat says:

    Thank you, thank you, thank you. I’m not used to swearing so loudly at Firefox–usually IE is the only browser with such special needs.

  3. Thorsten Petersen says:

    Thanks a lot. That is what I searched for.
    But I have a little problem. When I use tags , the plain text not contain what in
    the code tags is written, it returns the text in it plus
    arround it.

    How can I solve this problem?

  4. Pekka says:

    Thank you very much, saves me hours of work! The only problem with this code seems to be that it wont allow real double-linebreaks, I want the user to be able to add a linebreak like

    this, where there is an empty line between. Is it possible and if yes then please can you help me, how?