Regex replace text outside script tag

I have this HTML:



"This is simple html text <script language="javascript">simple simple text text</script> text"

I need to match only words that are outside script tag. I mean if I want to match “simple” and “text” I should get the results only from “This is simple html text” and the last part “text” — the result will be “simple” 1 match, “text” 2 matches. Could anyone help me with this? I’m using PHP.

I found a similar answer for match text outside a tag:

(text|simple)(?![^<]*>|[^<>]*</)

Regex replace text outside html tags

But couln't put to work for a specific tag (script):

(text|simple)(?!(^<script*>)|[^<>]*</)

ps: This question is not a duplicate (strip_tags, remove javascript). 'Cause i´m not trying to strip tags, or select the content inside the script tag. i´m trying replace content outside the tag "script".

edited Aug 27 '17 at 3:06

asked Aug 26 '17 at 22:16

Paulo A. Costa

8415

Do you absolutely need matching, or capturing groups will do?

– Vivick
Aug 26 '17 at 22:23

When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".

– mickmackusa
Aug 27 '17 at 2:48

@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".

– Paulo A. Costa
Aug 27 '17 at 3:00

Retracted dupe link, it is merely related.

– mickmackusa
Aug 27 '17 at 3:56

add a comment |

I have this HTML:



"This is simple html text <script language="javascript">simple simple text text</script> text"

I found a similar answer for match text outside a tag:

(text|simple)(?![^<]*>|[^<>]*</)

Regex replace text outside html tags

But couln't put to work for a specific tag (script):

(text|simple)(?!(^<script*>)|[^<>]*</)

edited Aug 27 '17 at 3:06

asked Aug 26 '17 at 22:16

Paulo A. Costa

8415

Do you absolutely need matching, or capturing groups will do?

– Vivick
Aug 26 '17 at 22:23

When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".

– mickmackusa
Aug 27 '17 at 2:48

@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".

– Paulo A. Costa
Aug 27 '17 at 3:00

Retracted dupe link, it is merely related.

– mickmackusa
Aug 27 '17 at 3:56

add a comment |

I have this HTML:



"This is simple html text <script language="javascript">simple simple text text</script> text"

I found a similar answer for match text outside a tag:

(text|simple)(?![^<]*>|[^<>]*</)

Regex replace text outside html tags

But couln't put to work for a specific tag (script):

(text|simple)(?!(^<script*>)|[^<>]*</)

edited Aug 27 '17 at 3:06

asked Aug 26 '17 at 22:16

Paulo A. Costa

8415

I have this HTML:



"This is simple html text <script language="javascript">simple simple text text</script> text"

I found a similar answer for match text outside a tag:

(text|simple)(?![^<]*>|[^<>]*</)

Regex replace text outside html tags

But couln't put to work for a specific tag (script):

(text|simple)(?!(^<script*>)|[^<>]*</)

php html regex preg-replace

edited Aug 27 '17 at 3:06

asked Aug 26 '17 at 22:16

Paulo A. Costa

8415

edited Aug 27 '17 at 3:06

asked Aug 26 '17 at 22:16

Paulo A. Costa

8415

edited Aug 27 '17 at 3:06

asked Aug 26 '17 at 22:16

Paulo A. Costa

8415

asked Aug 26 '17 at 22:16

Paulo A. Costa

8415

asked Aug 26 '17 at 22:16

Paulo A. Costa

8415

Do you absolutely need matching, or capturing groups will do?

– Vivick
Aug 26 '17 at 22:23

When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".

– mickmackusa
Aug 27 '17 at 2:48

@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".

– Paulo A. Costa
Aug 27 '17 at 3:00

Retracted dupe link, it is merely related.

– mickmackusa
Aug 27 '17 at 3:56

add a comment |

Do you absolutely need matching, or capturing groups will do?

– Vivick
Aug 26 '17 at 22:23

When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".

– mickmackusa
Aug 27 '17 at 2:48

@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".

– Paulo A. Costa
Aug 27 '17 at 3:00

Retracted dupe link, it is merely related.

– mickmackusa
Aug 27 '17 at 3:56

Do you absolutely need matching, or capturing groups will do?

– Vivick
Aug 26 '17 at 22:23

When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".

– mickmackusa
Aug 27 '17 at 2:48

@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".

– Paulo A. Costa
Aug 27 '17 at 3:00

Retracted dupe link, it is merely related.

– mickmackusa
Aug 27 '17 at 3:56

add a comment |

4 Answers
4

active

oldest

votes

My pattern will use (*SKIP)(*FAIL) to disqualify matched script tags and their contents.

text and simple will be match on every qualifying occurrence.

Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~

Pattern / Replacement Demo Link

Code: (Demo)

$strings=['This has no replacements',

    'This simple text has no script tag',

    'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',

    'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',

    '<script language="javascript">simple simple text text</script> this text starts with a script tag'

];



$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);



var_export($strings);

Output:

array (

  0 => 'This has no replacements',

  1 => 'This ***replaced*** ***replaced*** has no script tag',

  2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',

  3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',

  4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',

)

edited Aug 29 '17 at 0:15

answered Aug 27 '17 at 3:23

mickmackusa

22.8k103356

add a comment |

If it's assured that script will be present then simply match with

(.*?)<script.*</script>(.*)

The text outside the tag will appear in submatch 1 and 2. If script is optional then do (.*?)(<script.*</script>)?(.*).

answered Aug 26 '17 at 22:41

yacc

2,17231329

add a comment |

Here is another solution

([ws]*)(?:<script.*?/script>)(.*)$

and here is the demo on https://regex101.com/r/1Lthi8/1

edited Aug 26 '17 at 23:18

answered Aug 26 '17 at 22:49

JBone

3811416

I´m trying to replace string outside the <script></script> tag.

– Paulo A. Costa
Aug 26 '17 at 23:05

yes, this is captured in group 1 as regex101 highlighted This is simple html text

– JBone
Aug 26 '17 at 23:08

Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".

– Paulo A. Costa
Aug 26 '17 at 23:15

ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions

– JBone
Aug 26 '17 at 23:16

you still have questions or did this solution work?

– JBone
Aug 27 '17 at 12:21

add a comment |

Just an fyi, as far as tags go, it is impossible to ignore a single tag

without parsing all tags.

You can SKIP/FAIL past html tags and invisible content.

This will find the words you're looking for.

https://regex101.com/r/7ZGlvW/1

Formated

    <

    (?:

         (?:

              (?:

                                                 # Invisible content; end tag req'd

                   (                             # (1 start)

                        script

                     |  style

                     |  object

                     |  embed

                     |  applet

                     |  noframes

                     |  noscript

                     |  noembed 

                   )                             # (1 end)

                   (?:

                        s+ 

                        (?>

                             " [Ss]*? "

                          |  ' [Ss]*? '

                          |  (?:

                                  (?! /> )

                                  [^>] 

                             )?

                        )+

                   )?

                   s* >

              )



              [Ss]*? </ 1 s* 

              (?= > )

         )



      |  (?: /? [w:]+ s* /? )

      |  (?:

              [w:]+ 

              s+ 

              (?:

                   " [Ss]*? " 

                |  ' [Ss]*? ' 

                |  [^>]? 

              )+

              s* /?

         )

      |  ? [Ss]*? ?

      |  (?:

              !

              (?:

                   (?: DOCTYPE [Ss]*? )

                |  (?: [CDATA[ [Ss]*? ]] )

                |  (?: -- [Ss]*? -- )

                |  (?: ATTLIST [Ss]*? )

                |  (?: ENTITY [Ss]*? )

                |  (?: ELEMENT [Ss]*? )

              )

         )

    )

    >

    (*SKIP)

    (?!)

 |  

    (?: text | simple )

Or, a much faster approach is to match both tags AND the text you're

looking for.

Matching the tags moves past them.

If you're doing a replace, use a callback to determine what to replace.

Group 1 is a TAG or an Invisible Content run.

Group 3 is the words you're looking to replace.

So, in the callback, if group 1 matched, just return group 1.

If group 3 matched, replace with what you want to replace it with.

The regex

https://regex101.com/r/7ZGlvW/2

This regex is comparable to how SAX and DOM parsers parse tags.

I've posted this hundreds of times on SO.

Here is an example of how to remove all html tags:

https://regex101.com/r/oCVkZv/1

edited Aug 27 '17 at 0:58

answered Aug 27 '17 at 0:26

sln

26.4k31636

This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage

– Paulo A. Costa
Aug 28 '17 at 1:22

@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I said it is impossible to ignore a single tag without parsing all tags. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...

– sln
Aug 28 '17 at 22:04

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45900099%2fregex-replace-text-outside-script-tag%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

My pattern will use (*SKIP)(*FAIL) to disqualify matched script tags and their contents.

text and simple will be match on every qualifying occurrence.

Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~

Pattern / Replacement Demo Link

Code: (Demo)

$strings=['This has no replacements',

    'This simple text has no script tag',

    'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',

    'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',

    '<script language="javascript">simple simple text text</script> this text starts with a script tag'

];



$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);



var_export($strings);

Output:

array (

  0 => 'This has no replacements',

  1 => 'This ***replaced*** ***replaced*** has no script tag',

  2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',

  3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',

  4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',

)

edited Aug 29 '17 at 0:15

answered Aug 27 '17 at 3:23

mickmackusa

22.8k103356

add a comment |

My pattern will use (*SKIP)(*FAIL) to disqualify matched script tags and their contents.

text and simple will be match on every qualifying occurrence.

Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~

Pattern / Replacement Demo Link

Code: (Demo)

$strings=['This has no replacements',

    'This simple text has no script tag',

    'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',

    'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',

    '<script language="javascript">simple simple text text</script> this text starts with a script tag'

];



$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);



var_export($strings);

Output:

array (

  0 => 'This has no replacements',

  1 => 'This ***replaced*** ***replaced*** has no script tag',

  2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',

  3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',

  4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',

)

edited Aug 29 '17 at 0:15

answered Aug 27 '17 at 3:23

mickmackusa

22.8k103356

add a comment |

My pattern will use (*SKIP)(*FAIL) to disqualify matched script tags and their contents.

text and simple will be match on every qualifying occurrence.

Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~

Pattern / Replacement Demo Link

Code: (Demo)

$strings=['This has no replacements',

    'This simple text has no script tag',

    'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',

    'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',

    '<script language="javascript">simple simple text text</script> this text starts with a script tag'

];



$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);



var_export($strings);

Output:

array (

  0 => 'This has no replacements',

  1 => 'This ***replaced*** ***replaced*** has no script tag',

  2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',

  3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',

  4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',

)

edited Aug 29 '17 at 0:15

answered Aug 27 '17 at 3:23

mickmackusa

22.8k103356

My pattern will use (*SKIP)(*FAIL) to disqualify matched script tags and their contents.

text and simple will be match on every qualifying occurrence.

Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~

Pattern / Replacement Demo Link

Code: (Demo)

$strings=['This has no replacements',

    'This simple text has no script tag',

    'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',

    'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',

    '<script language="javascript">simple simple text text</script> this text starts with a script tag'

];



$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);



var_export($strings);

Output:

array (

  0 => 'This has no replacements',

  1 => 'This ***replaced*** ***replaced*** has no script tag',

  2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',

  3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',

  4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',

)

edited Aug 29 '17 at 0:15

answered Aug 27 '17 at 3:23

mickmackusa

22.8k103356

edited Aug 29 '17 at 0:15

answered Aug 27 '17 at 3:23

mickmackusa

22.8k103356

answered Aug 27 '17 at 3:23

mickmackusa

22.8k103356

answered Aug 27 '17 at 3:23

mickmackusa

22.8k103356

add a comment |

If it's assured that script will be present then simply match with

(.*?)<script.*</script>(.*)

The text outside the tag will appear in submatch 1 and 2. If script is optional then do (.*?)(<script.*</script>)?(.*).

answered Aug 26 '17 at 22:41

yacc

2,17231329

add a comment |

If it's assured that script will be present then simply match with

(.*?)<script.*</script>(.*)

The text outside the tag will appear in submatch 1 and 2. If script is optional then do (.*?)(<script.*</script>)?(.*).

answered Aug 26 '17 at 22:41

yacc

2,17231329

add a comment |

If it's assured that script will be present then simply match with

(.*?)<script.*</script>(.*)

The text outside the tag will appear in submatch 1 and 2. If script is optional then do (.*?)(<script.*</script>)?(.*).

answered Aug 26 '17 at 22:41

yacc

2,17231329

If it's assured that script will be present then simply match with

(.*?)<script.*</script>(.*)

The text outside the tag will appear in submatch 1 and 2. If script is optional then do (.*?)(<script.*</script>)?(.*).

answered Aug 26 '17 at 22:41

yacc

2,17231329

answered Aug 26 '17 at 22:41

yacc

2,17231329

answered Aug 26 '17 at 22:41

yacc

2,17231329

answered Aug 26 '17 at 22:41

yacc

2,17231329

add a comment |

Here is another solution

([ws]*)(?:<script.*?/script>)(.*)$

and here is the demo on https://regex101.com/r/1Lthi8/1

edited Aug 26 '17 at 23:18

answered Aug 26 '17 at 22:49

JBone

3811416

I´m trying to replace string outside the <script></script> tag.

– Paulo A. Costa
Aug 26 '17 at 23:05

yes, this is captured in group 1 as regex101 highlighted This is simple html text

– JBone
Aug 26 '17 at 23:08

Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".

– Paulo A. Costa
Aug 26 '17 at 23:15

ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions

– JBone
Aug 26 '17 at 23:16

you still have questions or did this solution work?

– JBone
Aug 27 '17 at 12:21

add a comment |

Here is another solution

([ws]*)(?:<script.*?/script>)(.*)$

and here is the demo on https://regex101.com/r/1Lthi8/1

edited Aug 26 '17 at 23:18

answered Aug 26 '17 at 22:49

JBone

3811416

I´m trying to replace string outside the <script></script> tag.

– Paulo A. Costa
Aug 26 '17 at 23:05

yes, this is captured in group 1 as regex101 highlighted This is simple html text

– JBone
Aug 26 '17 at 23:08

Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".

– Paulo A. Costa
Aug 26 '17 at 23:15

ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions

– JBone
Aug 26 '17 at 23:16

you still have questions or did this solution work?

– JBone
Aug 27 '17 at 12:21

add a comment |

Here is another solution

([ws]*)(?:<script.*?/script>)(.*)$

and here is the demo on https://regex101.com/r/1Lthi8/1

edited Aug 26 '17 at 23:18

answered Aug 26 '17 at 22:49

JBone

3811416

Here is another solution

([ws]*)(?:<script.*?/script>)(.*)$

and here is the demo on https://regex101.com/r/1Lthi8/1

edited Aug 26 '17 at 23:18

answered Aug 26 '17 at 22:49

JBone

3811416

edited Aug 26 '17 at 23:18

answered Aug 26 '17 at 22:49

JBone

3811416

answered Aug 26 '17 at 22:49

JBone

3811416

answered Aug 26 '17 at 22:49

JBone

3811416

I´m trying to replace string outside the <script></script> tag.

– Paulo A. Costa
Aug 26 '17 at 23:05

yes, this is captured in group 1 as regex101 highlighted This is simple html text

– JBone
Aug 26 '17 at 23:08

Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".

– Paulo A. Costa
Aug 26 '17 at 23:15

ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions

– JBone
Aug 26 '17 at 23:16

you still have questions or did this solution work?

– JBone
Aug 27 '17 at 12:21

add a comment |

I´m trying to replace string outside the <script></script> tag.

– Paulo A. Costa
Aug 26 '17 at 23:05

yes, this is captured in group 1 as regex101 highlighted This is simple html text

– JBone
Aug 26 '17 at 23:08

Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".

– Paulo A. Costa
Aug 26 '17 at 23:15

ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions

– JBone
Aug 26 '17 at 23:16

you still have questions or did this solution work?

– JBone
Aug 27 '17 at 12:21

I´m trying to replace string outside the <script></script> tag.

– Paulo A. Costa
Aug 26 '17 at 23:05

yes, this is captured in group 1 as regex101 highlighted This is simple html text

– JBone
Aug 26 '17 at 23:08

Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".

– Paulo A. Costa
Aug 26 '17 at 23:15

ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions

– JBone
Aug 26 '17 at 23:16

you still have questions or did this solution work?

– JBone
Aug 27 '17 at 12:21

add a comment |

Just an fyi, as far as tags go, it is impossible to ignore a single tag

without parsing all tags.

You can SKIP/FAIL past html tags and invisible content.

This will find the words you're looking for.

https://regex101.com/r/7ZGlvW/1

Formated

    <

    (?:

         (?:

              (?:

                                                 # Invisible content; end tag req'd

                   (                             # (1 start)

                        script

                     |  style

                     |  object

                     |  embed

                     |  applet

                     |  noframes

                     |  noscript

                     |  noembed 

                   )                             # (1 end)

                   (?:

                        s+ 

                        (?>

                             " [Ss]*? "

                          |  ' [Ss]*? '

                          |  (?:

                                  (?! /> )

                                  [^>] 

                             )?

                        )+

                   )?

                   s* >

              )



              [Ss]*? </ 1 s* 

              (?= > )

         )



      |  (?: /? [w:]+ s* /? )

      |  (?:

              [w:]+ 

              s+ 

              (?:

                   " [Ss]*? " 

                |  ' [Ss]*? ' 

                |  [^>]? 

              )+

              s* /?

         )

      |  ? [Ss]*? ?

      |  (?:

              !

              (?:

                   (?: DOCTYPE [Ss]*? )

                |  (?: [CDATA[ [Ss]*? ]] )

                |  (?: -- [Ss]*? -- )

                |  (?: ATTLIST [Ss]*? )

                |  (?: ENTITY [Ss]*? )

                |  (?: ELEMENT [Ss]*? )

              )

         )

    )

    >

    (*SKIP)

    (?!)

 |  

    (?: text | simple )

Or, a much faster approach is to match both tags AND the text you're

looking for.

Matching the tags moves past them.

If you're doing a replace, use a callback to determine what to replace.

Group 1 is a TAG or an Invisible Content run.

Group 3 is the words you're looking to replace.

So, in the callback, if group 1 matched, just return group 1.

If group 3 matched, replace with what you want to replace it with.

The regex

https://regex101.com/r/7ZGlvW/2

This regex is comparable to how SAX and DOM parsers parse tags.

I've posted this hundreds of times on SO.

Here is an example of how to remove all html tags:

https://regex101.com/r/oCVkZv/1

edited Aug 27 '17 at 0:58

answered Aug 27 '17 at 0:26

sln

26.4k31636

This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage

– Paulo A. Costa
Aug 28 '17 at 1:22

@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I said it is impossible to ignore a single tag without parsing all tags. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...

– sln
Aug 28 '17 at 22:04

add a comment |

Just an fyi, as far as tags go, it is impossible to ignore a single tag

without parsing all tags.

You can SKIP/FAIL past html tags and invisible content.

This will find the words you're looking for.

https://regex101.com/r/7ZGlvW/1

Formated

    <

    (?:

         (?:

              (?:

                                                 # Invisible content; end tag req'd

                   (                             # (1 start)

                        script

                     |  style

                     |  object

                     |  embed

                     |  applet

                     |  noframes

                     |  noscript

                     |  noembed 

                   )                             # (1 end)

                   (?:

                        s+ 

                        (?>

                             " [Ss]*? "

                          |  ' [Ss]*? '

                          |  (?:

                                  (?! /> )

                                  [^>] 

                             )?

                        )+

                   )?

                   s* >

              )



              [Ss]*? </ 1 s* 

              (?= > )

         )



      |  (?: /? [w:]+ s* /? )

      |  (?:

              [w:]+ 

              s+ 

              (?:

                   " [Ss]*? " 

                |  ' [Ss]*? ' 

                |  [^>]? 

              )+

              s* /?

         )

      |  ? [Ss]*? ?

      |  (?:

              !

              (?:

                   (?: DOCTYPE [Ss]*? )

                |  (?: [CDATA[ [Ss]*? ]] )

                |  (?: -- [Ss]*? -- )

                |  (?: ATTLIST [Ss]*? )

                |  (?: ENTITY [Ss]*? )

                |  (?: ELEMENT [Ss]*? )

              )

         )

    )

    >

    (*SKIP)

    (?!)

 |  

    (?: text | simple )

Or, a much faster approach is to match both tags AND the text you're

looking for.

Matching the tags moves past them.

If you're doing a replace, use a callback to determine what to replace.

Group 1 is a TAG or an Invisible Content run.

Group 3 is the words you're looking to replace.

So, in the callback, if group 1 matched, just return group 1.

If group 3 matched, replace with what you want to replace it with.

The regex

https://regex101.com/r/7ZGlvW/2

This regex is comparable to how SAX and DOM parsers parse tags.

I've posted this hundreds of times on SO.

Here is an example of how to remove all html tags:

https://regex101.com/r/oCVkZv/1

edited Aug 27 '17 at 0:58

answered Aug 27 '17 at 0:26

sln

26.4k31636

This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage

– Paulo A. Costa
Aug 28 '17 at 1:22

@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I said it is impossible to ignore a single tag without parsing all tags. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...

– sln
Aug 28 '17 at 22:04

add a comment |

Just an fyi, as far as tags go, it is impossible to ignore a single tag

without parsing all tags.

You can SKIP/FAIL past html tags and invisible content.

This will find the words you're looking for.

https://regex101.com/r/7ZGlvW/1

Formated

    <

    (?:

         (?:

              (?:

                                                 # Invisible content; end tag req'd

                   (                             # (1 start)

                        script

                     |  style

                     |  object

                     |  embed

                     |  applet

                     |  noframes

                     |  noscript

                     |  noembed 

                   )                             # (1 end)

                   (?:

                        s+ 

                        (?>

                             " [Ss]*? "

                          |  ' [Ss]*? '

                          |  (?:

                                  (?! /> )

                                  [^>] 

                             )?

                        )+

                   )?

                   s* >

              )



              [Ss]*? </ 1 s* 

              (?= > )

         )



      |  (?: /? [w:]+ s* /? )

      |  (?:

              [w:]+ 

              s+ 

              (?:

                   " [Ss]*? " 

                |  ' [Ss]*? ' 

                |  [^>]? 

              )+

              s* /?

         )

      |  ? [Ss]*? ?

      |  (?:

              !

              (?:

                   (?: DOCTYPE [Ss]*? )

                |  (?: [CDATA[ [Ss]*? ]] )

                |  (?: -- [Ss]*? -- )

                |  (?: ATTLIST [Ss]*? )

                |  (?: ENTITY [Ss]*? )

                |  (?: ELEMENT [Ss]*? )

              )

         )

    )

    >

    (*SKIP)

    (?!)

 |  

    (?: text | simple )

Or, a much faster approach is to match both tags AND the text you're

looking for.

Matching the tags moves past them.

If you're doing a replace, use a callback to determine what to replace.

Group 1 is a TAG or an Invisible Content run.

Group 3 is the words you're looking to replace.

So, in the callback, if group 1 matched, just return group 1.

If group 3 matched, replace with what you want to replace it with.

The regex

https://regex101.com/r/7ZGlvW/2

This regex is comparable to how SAX and DOM parsers parse tags.

I've posted this hundreds of times on SO.

Here is an example of how to remove all html tags:

https://regex101.com/r/oCVkZv/1

edited Aug 27 '17 at 0:58

answered Aug 27 '17 at 0:26

sln

26.4k31636

Just an fyi, as far as tags go, it is impossible to ignore a single tag

without parsing all tags.

You can SKIP/FAIL past html tags and invisible content.

This will find the words you're looking for.

https://regex101.com/r/7ZGlvW/1

Formated

    <

    (?:

         (?:

              (?:

                                                 # Invisible content; end tag req'd

                   (                             # (1 start)

                        script

                     |  style

                     |  object

                     |  embed

                     |  applet

                     |  noframes

                     |  noscript

                     |  noembed 

                   )                             # (1 end)

                   (?:

                        s+ 

                        (?>

                             " [Ss]*? "

                          |  ' [Ss]*? '

                          |  (?:

                                  (?! /> )

                                  [^>] 

                             )?

                        )+

                   )?

                   s* >

              )



              [Ss]*? </ 1 s* 

              (?= > )

         )



      |  (?: /? [w:]+ s* /? )

      |  (?:

              [w:]+ 

              s+ 

              (?:

                   " [Ss]*? " 

                |  ' [Ss]*? ' 

                |  [^>]? 

              )+

              s* /?

         )

      |  ? [Ss]*? ?

      |  (?:

              !

              (?:

                   (?: DOCTYPE [Ss]*? )

                |  (?: [CDATA[ [Ss]*? ]] )

                |  (?: -- [Ss]*? -- )

                |  (?: ATTLIST [Ss]*? )

                |  (?: ENTITY [Ss]*? )

                |  (?: ELEMENT [Ss]*? )

              )

         )

    )

    >

    (*SKIP)

    (?!)

 |  

    (?: text | simple )

Or, a much faster approach is to match both tags AND the text you're

looking for.

Matching the tags moves past them.

If you're doing a replace, use a callback to determine what to replace.

Group 1 is a TAG or an Invisible Content run.

Group 3 is the words you're looking to replace.

So, in the callback, if group 1 matched, just return group 1.

If group 3 matched, replace with what you want to replace it with.

The regex

https://regex101.com/r/7ZGlvW/2

This regex is comparable to how SAX and DOM parsers parse tags.

I've posted this hundreds of times on SO.

Here is an example of how to remove all html tags:

https://regex101.com/r/oCVkZv/1

edited Aug 27 '17 at 0:58

answered Aug 27 '17 at 0:26

sln

26.4k31636

edited Aug 27 '17 at 0:58

answered Aug 27 '17 at 0:26

sln

26.4k31636

answered Aug 27 '17 at 0:26

sln

26.4k31636

answered Aug 27 '17 at 0:26

sln

26.4k31636

This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage

– Paulo A. Costa
Aug 28 '17 at 1:22

@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I said it is impossible to ignore a single tag without parsing all tags. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...

– sln
Aug 28 '17 at 22:04

add a comment |

This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage

– Paulo A. Costa
Aug 28 '17 at 1:22

@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I said it is impossible to ignore a single tag without parsing all tags. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...

– sln
Aug 28 '17 at 22:04

This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage

– Paulo A. Costa
Aug 28 '17 at 1:22

@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I said it is impossible to ignore a single tag without parsing all tags. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...

– sln
Aug 28 '17 at 22:04

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Brtdku