Regex replace text outside script tag
I have this HTML:
"This is simple html text <script language="javascript">simple simple text text</script> text"
I need to match only words that are outside script tag. I mean if I want to match “simple” and “text” I should get the results only from “This is simple html text” and the last part “text” — the result will be “simple” 1 match, “text” 2 matches. Could anyone help me with this? I’m using PHP.
I found a similar answer for match text outside a tag:
(text|simple)(?![^<]*>|[^<>]*</)
Regex replace text outside html tags
But couln't put to work for a specific tag (script):
(text|simple)(?!(^<script*>)|[^<>]*</)
ps: This question is not a duplicate (strip_tags, remove javascript). 'Cause i´m not trying to strip tags, or select the content inside the script tag. i´m trying replace content outside the tag "script".
php html regex preg-replace
add a comment |
I have this HTML:
"This is simple html text <script language="javascript">simple simple text text</script> text"
I need to match only words that are outside script tag. I mean if I want to match “simple” and “text” I should get the results only from “This is simple html text” and the last part “text” — the result will be “simple” 1 match, “text” 2 matches. Could anyone help me with this? I’m using PHP.
I found a similar answer for match text outside a tag:
(text|simple)(?![^<]*>|[^<>]*</)
Regex replace text outside html tags
But couln't put to work for a specific tag (script):
(text|simple)(?!(^<script*>)|[^<>]*</)
ps: This question is not a duplicate (strip_tags, remove javascript). 'Cause i´m not trying to strip tags, or select the content inside the script tag. i´m trying replace content outside the tag "script".
php html regex preg-replace
Do you absolutely need matching, or capturing groups will do?
– Vivick
Aug 26 '17 at 22:23
When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".
– mickmackusa
Aug 27 '17 at 2:48
@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".
– Paulo A. Costa
Aug 27 '17 at 3:00
Retracted dupe link, it is merely related.
– mickmackusa
Aug 27 '17 at 3:56
add a comment |
I have this HTML:
"This is simple html text <script language="javascript">simple simple text text</script> text"
I need to match only words that are outside script tag. I mean if I want to match “simple” and “text” I should get the results only from “This is simple html text” and the last part “text” — the result will be “simple” 1 match, “text” 2 matches. Could anyone help me with this? I’m using PHP.
I found a similar answer for match text outside a tag:
(text|simple)(?![^<]*>|[^<>]*</)
Regex replace text outside html tags
But couln't put to work for a specific tag (script):
(text|simple)(?!(^<script*>)|[^<>]*</)
ps: This question is not a duplicate (strip_tags, remove javascript). 'Cause i´m not trying to strip tags, or select the content inside the script tag. i´m trying replace content outside the tag "script".
php html regex preg-replace
I have this HTML:
"This is simple html text <script language="javascript">simple simple text text</script> text"
I need to match only words that are outside script tag. I mean if I want to match “simple” and “text” I should get the results only from “This is simple html text” and the last part “text” — the result will be “simple” 1 match, “text” 2 matches. Could anyone help me with this? I’m using PHP.
I found a similar answer for match text outside a tag:
(text|simple)(?![^<]*>|[^<>]*</)
Regex replace text outside html tags
But couln't put to work for a specific tag (script):
(text|simple)(?!(^<script*>)|[^<>]*</)
ps: This question is not a duplicate (strip_tags, remove javascript). 'Cause i´m not trying to strip tags, or select the content inside the script tag. i´m trying replace content outside the tag "script".
php html regex preg-replace
php html regex preg-replace
edited Aug 27 '17 at 3:06
Paulo A. Costa
asked Aug 26 '17 at 22:16
Paulo A. CostaPaulo A. Costa
8415
8415
Do you absolutely need matching, or capturing groups will do?
– Vivick
Aug 26 '17 at 22:23
When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".
– mickmackusa
Aug 27 '17 at 2:48
@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".
– Paulo A. Costa
Aug 27 '17 at 3:00
Retracted dupe link, it is merely related.
– mickmackusa
Aug 27 '17 at 3:56
add a comment |
Do you absolutely need matching, or capturing groups will do?
– Vivick
Aug 26 '17 at 22:23
When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".
– mickmackusa
Aug 27 '17 at 2:48
@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".
– Paulo A. Costa
Aug 27 '17 at 3:00
Retracted dupe link, it is merely related.
– mickmackusa
Aug 27 '17 at 3:56
Do you absolutely need matching, or capturing groups will do?
– Vivick
Aug 26 '17 at 22:23
Do you absolutely need matching, or capturing groups will do?
– Vivick
Aug 26 '17 at 22:23
When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".
– mickmackusa
Aug 27 '17 at 2:48
When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".
– mickmackusa
Aug 27 '17 at 2:48
@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".
– Paulo A. Costa
Aug 27 '17 at 3:00
@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".
– Paulo A. Costa
Aug 27 '17 at 3:00
Retracted dupe link, it is merely related.
– mickmackusa
Aug 27 '17 at 3:56
Retracted dupe link, it is merely related.
– mickmackusa
Aug 27 '17 at 3:56
add a comment |
4 Answers
4
active
oldest
votes
My pattern will use (*SKIP)(*FAIL)
to disqualify matched script tags and their contents.
text
and simple
will be match on every qualifying occurrence.
Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~
Pattern / Replacement Demo Link
Code: (Demo)
$strings=['This has no replacements',
'This simple text has no script tag',
'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',
'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',
'<script language="javascript">simple simple text text</script> this text starts with a script tag'
];
$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);
var_export($strings);
Output:
array (
0 => 'This has no replacements',
1 => 'This ***replaced*** ***replaced*** has no script tag',
2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',
3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',
4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',
)
add a comment |
If it's assured that script
will be present then simply match with
(.*?)<script.*</script>(.*)
The text outside the tag will appear in submatch 1 and 2. If script
is optional then do (.*?)(<script.*</script>)?(.*)
.
add a comment |
Here is another solution
([ws]*)(?:<script.*?/script>)(.*)$
and here is the demo on https://regex101.com/r/1Lthi8/1
I´m trying to replace string outside the <script></script> tag.
– Paulo A. Costa
Aug 26 '17 at 23:05
yes, this is captured in group 1 as regex101 highlightedThis is simple html text
– JBone
Aug 26 '17 at 23:08
Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".
– Paulo A. Costa
Aug 26 '17 at 23:15
ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions
– JBone
Aug 26 '17 at 23:16
you still have questions or did this solution work?
– JBone
Aug 27 '17 at 12:21
add a comment |
Just an fyi, as far as tags go, it is impossible to ignore a single tag
without parsing all tags.
You can SKIP/FAIL past html tags and invisible content.
This will find the words you're looking for.
'~<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:s+(?>"[Ss]*?"|'[Ss]*?'|(?:(?!/>)[^>])?)+)?s*>)[Ss]*?</1s*(?=>))|(?:/?[w:]+s*/?)|(?:[w:]+s+(?:"[Ss]*?"|'[Ss]*?'|[^>]?)+s*/?)|?[Ss]*??|(?:!(?:(?:DOCTYPE[Ss]*?)|(?:[CDATA[[Ss]*?]])|(?:--[Ss]*?--)|(?:ATTLIST[Ss]*?)|(?:ENTITY[Ss]*?)|(?:ELEMENT[Ss]*?))))>(*SKIP)(?!)|(?:text|simple)~'
https://regex101.com/r/7ZGlvW/1
Formated
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
s+
(?>
" [Ss]*? "
| ' [Ss]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
s* >
)
[Ss]*? </ 1 s*
(?= > )
)
| (?: /? [w:]+ s* /? )
| (?:
[w:]+
s+
(?:
" [Ss]*? "
| ' [Ss]*? '
| [^>]?
)+
s* /?
)
| ? [Ss]*? ?
| (?:
!
(?:
(?: DOCTYPE [Ss]*? )
| (?: [CDATA[ [Ss]*? ]] )
| (?: -- [Ss]*? -- )
| (?: ATTLIST [Ss]*? )
| (?: ENTITY [Ss]*? )
| (?: ELEMENT [Ss]*? )
)
)
)
>
(*SKIP)
(?!)
|
(?: text | simple )
Or, a much faster approach is to match both tags AND the text you're
looking for.
Matching the tags moves past them.
If you're doing a replace, use a callback to determine what to replace.
Group 1 is a TAG or an Invisible Content run.
Group 3 is the words you're looking to replace.
So, in the callback, if group 1 matched, just return group 1.
If group 3 matched, replace with what you want to replace it with.
The regex
'~(<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:s+(?>"[Ss]*?"|'[Ss]*?'|(?:(?!/>)[^>])?)+)?s*>)[Ss]*?</2s*(?=>))|(?:/?[w:]+s*/?)|(?:[w:]+s+(?:"[Ss]*?"|'[Ss]*?'|[^>]?)+s*/?)|?[Ss]*??|(?:!(?:(?:DOCTYPE[Ss]*?)|(?:[CDATA[[Ss]*?]])|(?:--[Ss]*?--)|(?:ATTLIST[Ss]*?)|(?:ENTITY[Ss]*?)|(?:ELEMENT[Ss]*?))))>)|(text|simple)~'
https://regex101.com/r/7ZGlvW/2
This regex is comparable to how SAX and DOM parsers parse tags.
I've posted this hundreds of times on SO.
Here is an example of how to remove all html tags:
https://regex101.com/r/oCVkZv/1
This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage
– Paulo A. Costa
Aug 28 '17 at 1:22
@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I saidit is impossible to ignore a single tag without parsing all tags
. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...
– sln
Aug 28 '17 at 22:04
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45900099%2fregex-replace-text-outside-script-tag%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
My pattern will use (*SKIP)(*FAIL)
to disqualify matched script tags and their contents.
text
and simple
will be match on every qualifying occurrence.
Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~
Pattern / Replacement Demo Link
Code: (Demo)
$strings=['This has no replacements',
'This simple text has no script tag',
'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',
'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',
'<script language="javascript">simple simple text text</script> this text starts with a script tag'
];
$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);
var_export($strings);
Output:
array (
0 => 'This has no replacements',
1 => 'This ***replaced*** ***replaced*** has no script tag',
2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',
3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',
4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',
)
add a comment |
My pattern will use (*SKIP)(*FAIL)
to disqualify matched script tags and their contents.
text
and simple
will be match on every qualifying occurrence.
Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~
Pattern / Replacement Demo Link
Code: (Demo)
$strings=['This has no replacements',
'This simple text has no script tag',
'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',
'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',
'<script language="javascript">simple simple text text</script> this text starts with a script tag'
];
$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);
var_export($strings);
Output:
array (
0 => 'This has no replacements',
1 => 'This ***replaced*** ***replaced*** has no script tag',
2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',
3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',
4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',
)
add a comment |
My pattern will use (*SKIP)(*FAIL)
to disqualify matched script tags and their contents.
text
and simple
will be match on every qualifying occurrence.
Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~
Pattern / Replacement Demo Link
Code: (Demo)
$strings=['This has no replacements',
'This simple text has no script tag',
'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',
'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',
'<script language="javascript">simple simple text text</script> this text starts with a script tag'
];
$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);
var_export($strings);
Output:
array (
0 => 'This has no replacements',
1 => 'This ***replaced*** ***replaced*** has no script tag',
2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',
3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',
4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',
)
My pattern will use (*SKIP)(*FAIL)
to disqualify matched script tags and their contents.
text
and simple
will be match on every qualifying occurrence.
Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~
Pattern / Replacement Demo Link
Code: (Demo)
$strings=['This has no replacements',
'This simple text has no script tag',
'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',
'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',
'<script language="javascript">simple simple text text</script> this text starts with a script tag'
];
$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);
var_export($strings);
Output:
array (
0 => 'This has no replacements',
1 => 'This ***replaced*** ***replaced*** has no script tag',
2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',
3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',
4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',
)
edited Aug 29 '17 at 0:15
answered Aug 27 '17 at 3:23
mickmackusamickmackusa
22.8k103356
22.8k103356
add a comment |
add a comment |
If it's assured that script
will be present then simply match with
(.*?)<script.*</script>(.*)
The text outside the tag will appear in submatch 1 and 2. If script
is optional then do (.*?)(<script.*</script>)?(.*)
.
add a comment |
If it's assured that script
will be present then simply match with
(.*?)<script.*</script>(.*)
The text outside the tag will appear in submatch 1 and 2. If script
is optional then do (.*?)(<script.*</script>)?(.*)
.
add a comment |
If it's assured that script
will be present then simply match with
(.*?)<script.*</script>(.*)
The text outside the tag will appear in submatch 1 and 2. If script
is optional then do (.*?)(<script.*</script>)?(.*)
.
If it's assured that script
will be present then simply match with
(.*?)<script.*</script>(.*)
The text outside the tag will appear in submatch 1 and 2. If script
is optional then do (.*?)(<script.*</script>)?(.*)
.
answered Aug 26 '17 at 22:41
yaccyacc
2,17231329
2,17231329
add a comment |
add a comment |
Here is another solution
([ws]*)(?:<script.*?/script>)(.*)$
and here is the demo on https://regex101.com/r/1Lthi8/1
I´m trying to replace string outside the <script></script> tag.
– Paulo A. Costa
Aug 26 '17 at 23:05
yes, this is captured in group 1 as regex101 highlightedThis is simple html text
– JBone
Aug 26 '17 at 23:08
Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".
– Paulo A. Costa
Aug 26 '17 at 23:15
ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions
– JBone
Aug 26 '17 at 23:16
you still have questions or did this solution work?
– JBone
Aug 27 '17 at 12:21
add a comment |
Here is another solution
([ws]*)(?:<script.*?/script>)(.*)$
and here is the demo on https://regex101.com/r/1Lthi8/1
I´m trying to replace string outside the <script></script> tag.
– Paulo A. Costa
Aug 26 '17 at 23:05
yes, this is captured in group 1 as regex101 highlightedThis is simple html text
– JBone
Aug 26 '17 at 23:08
Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".
– Paulo A. Costa
Aug 26 '17 at 23:15
ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions
– JBone
Aug 26 '17 at 23:16
you still have questions or did this solution work?
– JBone
Aug 27 '17 at 12:21
add a comment |
Here is another solution
([ws]*)(?:<script.*?/script>)(.*)$
and here is the demo on https://regex101.com/r/1Lthi8/1
Here is another solution
([ws]*)(?:<script.*?/script>)(.*)$
and here is the demo on https://regex101.com/r/1Lthi8/1
edited Aug 26 '17 at 23:18
answered Aug 26 '17 at 22:49
JBoneJBone
3811416
3811416
I´m trying to replace string outside the <script></script> tag.
– Paulo A. Costa
Aug 26 '17 at 23:05
yes, this is captured in group 1 as regex101 highlightedThis is simple html text
– JBone
Aug 26 '17 at 23:08
Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".
– Paulo A. Costa
Aug 26 '17 at 23:15
ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions
– JBone
Aug 26 '17 at 23:16
you still have questions or did this solution work?
– JBone
Aug 27 '17 at 12:21
add a comment |
I´m trying to replace string outside the <script></script> tag.
– Paulo A. Costa
Aug 26 '17 at 23:05
yes, this is captured in group 1 as regex101 highlightedThis is simple html text
– JBone
Aug 26 '17 at 23:08
Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".
– Paulo A. Costa
Aug 26 '17 at 23:15
ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions
– JBone
Aug 26 '17 at 23:16
you still have questions or did this solution work?
– JBone
Aug 27 '17 at 12:21
I´m trying to replace string outside the <script></script> tag.
– Paulo A. Costa
Aug 26 '17 at 23:05
I´m trying to replace string outside the <script></script> tag.
– Paulo A. Costa
Aug 26 '17 at 23:05
yes, this is captured in group 1 as regex101 highlighted
This is simple html text
– JBone
Aug 26 '17 at 23:08
yes, this is captured in group 1 as regex101 highlighted
This is simple html text
– JBone
Aug 26 '17 at 23:08
Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".
– Paulo A. Costa
Aug 26 '17 at 23:15
Match 2 is inside the tag and the last word "text" is not being selected. and finally, this is trying to ignore all tags, not the specifc tag "script".
– Paulo A. Costa
Aug 26 '17 at 23:15
ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions
– JBone
Aug 26 '17 at 23:16
ha .. I see the problem ... I missed seeing that second text. I updated my answer and the regex demo. Let me know if you still have issues/questions
– JBone
Aug 26 '17 at 23:16
you still have questions or did this solution work?
– JBone
Aug 27 '17 at 12:21
you still have questions or did this solution work?
– JBone
Aug 27 '17 at 12:21
add a comment |
Just an fyi, as far as tags go, it is impossible to ignore a single tag
without parsing all tags.
You can SKIP/FAIL past html tags and invisible content.
This will find the words you're looking for.
'~<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:s+(?>"[Ss]*?"|'[Ss]*?'|(?:(?!/>)[^>])?)+)?s*>)[Ss]*?</1s*(?=>))|(?:/?[w:]+s*/?)|(?:[w:]+s+(?:"[Ss]*?"|'[Ss]*?'|[^>]?)+s*/?)|?[Ss]*??|(?:!(?:(?:DOCTYPE[Ss]*?)|(?:[CDATA[[Ss]*?]])|(?:--[Ss]*?--)|(?:ATTLIST[Ss]*?)|(?:ENTITY[Ss]*?)|(?:ELEMENT[Ss]*?))))>(*SKIP)(?!)|(?:text|simple)~'
https://regex101.com/r/7ZGlvW/1
Formated
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
s+
(?>
" [Ss]*? "
| ' [Ss]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
s* >
)
[Ss]*? </ 1 s*
(?= > )
)
| (?: /? [w:]+ s* /? )
| (?:
[w:]+
s+
(?:
" [Ss]*? "
| ' [Ss]*? '
| [^>]?
)+
s* /?
)
| ? [Ss]*? ?
| (?:
!
(?:
(?: DOCTYPE [Ss]*? )
| (?: [CDATA[ [Ss]*? ]] )
| (?: -- [Ss]*? -- )
| (?: ATTLIST [Ss]*? )
| (?: ENTITY [Ss]*? )
| (?: ELEMENT [Ss]*? )
)
)
)
>
(*SKIP)
(?!)
|
(?: text | simple )
Or, a much faster approach is to match both tags AND the text you're
looking for.
Matching the tags moves past them.
If you're doing a replace, use a callback to determine what to replace.
Group 1 is a TAG or an Invisible Content run.
Group 3 is the words you're looking to replace.
So, in the callback, if group 1 matched, just return group 1.
If group 3 matched, replace with what you want to replace it with.
The regex
'~(<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:s+(?>"[Ss]*?"|'[Ss]*?'|(?:(?!/>)[^>])?)+)?s*>)[Ss]*?</2s*(?=>))|(?:/?[w:]+s*/?)|(?:[w:]+s+(?:"[Ss]*?"|'[Ss]*?'|[^>]?)+s*/?)|?[Ss]*??|(?:!(?:(?:DOCTYPE[Ss]*?)|(?:[CDATA[[Ss]*?]])|(?:--[Ss]*?--)|(?:ATTLIST[Ss]*?)|(?:ENTITY[Ss]*?)|(?:ELEMENT[Ss]*?))))>)|(text|simple)~'
https://regex101.com/r/7ZGlvW/2
This regex is comparable to how SAX and DOM parsers parse tags.
I've posted this hundreds of times on SO.
Here is an example of how to remove all html tags:
https://regex101.com/r/oCVkZv/1
This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage
– Paulo A. Costa
Aug 28 '17 at 1:22
@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I saidit is impossible to ignore a single tag without parsing all tags
. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...
– sln
Aug 28 '17 at 22:04
add a comment |
Just an fyi, as far as tags go, it is impossible to ignore a single tag
without parsing all tags.
You can SKIP/FAIL past html tags and invisible content.
This will find the words you're looking for.
'~<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:s+(?>"[Ss]*?"|'[Ss]*?'|(?:(?!/>)[^>])?)+)?s*>)[Ss]*?</1s*(?=>))|(?:/?[w:]+s*/?)|(?:[w:]+s+(?:"[Ss]*?"|'[Ss]*?'|[^>]?)+s*/?)|?[Ss]*??|(?:!(?:(?:DOCTYPE[Ss]*?)|(?:[CDATA[[Ss]*?]])|(?:--[Ss]*?--)|(?:ATTLIST[Ss]*?)|(?:ENTITY[Ss]*?)|(?:ELEMENT[Ss]*?))))>(*SKIP)(?!)|(?:text|simple)~'
https://regex101.com/r/7ZGlvW/1
Formated
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
s+
(?>
" [Ss]*? "
| ' [Ss]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
s* >
)
[Ss]*? </ 1 s*
(?= > )
)
| (?: /? [w:]+ s* /? )
| (?:
[w:]+
s+
(?:
" [Ss]*? "
| ' [Ss]*? '
| [^>]?
)+
s* /?
)
| ? [Ss]*? ?
| (?:
!
(?:
(?: DOCTYPE [Ss]*? )
| (?: [CDATA[ [Ss]*? ]] )
| (?: -- [Ss]*? -- )
| (?: ATTLIST [Ss]*? )
| (?: ENTITY [Ss]*? )
| (?: ELEMENT [Ss]*? )
)
)
)
>
(*SKIP)
(?!)
|
(?: text | simple )
Or, a much faster approach is to match both tags AND the text you're
looking for.
Matching the tags moves past them.
If you're doing a replace, use a callback to determine what to replace.
Group 1 is a TAG or an Invisible Content run.
Group 3 is the words you're looking to replace.
So, in the callback, if group 1 matched, just return group 1.
If group 3 matched, replace with what you want to replace it with.
The regex
'~(<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:s+(?>"[Ss]*?"|'[Ss]*?'|(?:(?!/>)[^>])?)+)?s*>)[Ss]*?</2s*(?=>))|(?:/?[w:]+s*/?)|(?:[w:]+s+(?:"[Ss]*?"|'[Ss]*?'|[^>]?)+s*/?)|?[Ss]*??|(?:!(?:(?:DOCTYPE[Ss]*?)|(?:[CDATA[[Ss]*?]])|(?:--[Ss]*?--)|(?:ATTLIST[Ss]*?)|(?:ENTITY[Ss]*?)|(?:ELEMENT[Ss]*?))))>)|(text|simple)~'
https://regex101.com/r/7ZGlvW/2
This regex is comparable to how SAX and DOM parsers parse tags.
I've posted this hundreds of times on SO.
Here is an example of how to remove all html tags:
https://regex101.com/r/oCVkZv/1
This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage
– Paulo A. Costa
Aug 28 '17 at 1:22
@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I saidit is impossible to ignore a single tag without parsing all tags
. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...
– sln
Aug 28 '17 at 22:04
add a comment |
Just an fyi, as far as tags go, it is impossible to ignore a single tag
without parsing all tags.
You can SKIP/FAIL past html tags and invisible content.
This will find the words you're looking for.
'~<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:s+(?>"[Ss]*?"|'[Ss]*?'|(?:(?!/>)[^>])?)+)?s*>)[Ss]*?</1s*(?=>))|(?:/?[w:]+s*/?)|(?:[w:]+s+(?:"[Ss]*?"|'[Ss]*?'|[^>]?)+s*/?)|?[Ss]*??|(?:!(?:(?:DOCTYPE[Ss]*?)|(?:[CDATA[[Ss]*?]])|(?:--[Ss]*?--)|(?:ATTLIST[Ss]*?)|(?:ENTITY[Ss]*?)|(?:ELEMENT[Ss]*?))))>(*SKIP)(?!)|(?:text|simple)~'
https://regex101.com/r/7ZGlvW/1
Formated
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
s+
(?>
" [Ss]*? "
| ' [Ss]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
s* >
)
[Ss]*? </ 1 s*
(?= > )
)
| (?: /? [w:]+ s* /? )
| (?:
[w:]+
s+
(?:
" [Ss]*? "
| ' [Ss]*? '
| [^>]?
)+
s* /?
)
| ? [Ss]*? ?
| (?:
!
(?:
(?: DOCTYPE [Ss]*? )
| (?: [CDATA[ [Ss]*? ]] )
| (?: -- [Ss]*? -- )
| (?: ATTLIST [Ss]*? )
| (?: ENTITY [Ss]*? )
| (?: ELEMENT [Ss]*? )
)
)
)
>
(*SKIP)
(?!)
|
(?: text | simple )
Or, a much faster approach is to match both tags AND the text you're
looking for.
Matching the tags moves past them.
If you're doing a replace, use a callback to determine what to replace.
Group 1 is a TAG or an Invisible Content run.
Group 3 is the words you're looking to replace.
So, in the callback, if group 1 matched, just return group 1.
If group 3 matched, replace with what you want to replace it with.
The regex
'~(<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:s+(?>"[Ss]*?"|'[Ss]*?'|(?:(?!/>)[^>])?)+)?s*>)[Ss]*?</2s*(?=>))|(?:/?[w:]+s*/?)|(?:[w:]+s+(?:"[Ss]*?"|'[Ss]*?'|[^>]?)+s*/?)|?[Ss]*??|(?:!(?:(?:DOCTYPE[Ss]*?)|(?:[CDATA[[Ss]*?]])|(?:--[Ss]*?--)|(?:ATTLIST[Ss]*?)|(?:ENTITY[Ss]*?)|(?:ELEMENT[Ss]*?))))>)|(text|simple)~'
https://regex101.com/r/7ZGlvW/2
This regex is comparable to how SAX and DOM parsers parse tags.
I've posted this hundreds of times on SO.
Here is an example of how to remove all html tags:
https://regex101.com/r/oCVkZv/1
Just an fyi, as far as tags go, it is impossible to ignore a single tag
without parsing all tags.
You can SKIP/FAIL past html tags and invisible content.
This will find the words you're looking for.
'~<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:s+(?>"[Ss]*?"|'[Ss]*?'|(?:(?!/>)[^>])?)+)?s*>)[Ss]*?</1s*(?=>))|(?:/?[w:]+s*/?)|(?:[w:]+s+(?:"[Ss]*?"|'[Ss]*?'|[^>]?)+s*/?)|?[Ss]*??|(?:!(?:(?:DOCTYPE[Ss]*?)|(?:[CDATA[[Ss]*?]])|(?:--[Ss]*?--)|(?:ATTLIST[Ss]*?)|(?:ENTITY[Ss]*?)|(?:ELEMENT[Ss]*?))))>(*SKIP)(?!)|(?:text|simple)~'
https://regex101.com/r/7ZGlvW/1
Formated
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
s+
(?>
" [Ss]*? "
| ' [Ss]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
s* >
)
[Ss]*? </ 1 s*
(?= > )
)
| (?: /? [w:]+ s* /? )
| (?:
[w:]+
s+
(?:
" [Ss]*? "
| ' [Ss]*? '
| [^>]?
)+
s* /?
)
| ? [Ss]*? ?
| (?:
!
(?:
(?: DOCTYPE [Ss]*? )
| (?: [CDATA[ [Ss]*? ]] )
| (?: -- [Ss]*? -- )
| (?: ATTLIST [Ss]*? )
| (?: ENTITY [Ss]*? )
| (?: ELEMENT [Ss]*? )
)
)
)
>
(*SKIP)
(?!)
|
(?: text | simple )
Or, a much faster approach is to match both tags AND the text you're
looking for.
Matching the tags moves past them.
If you're doing a replace, use a callback to determine what to replace.
Group 1 is a TAG or an Invisible Content run.
Group 3 is the words you're looking to replace.
So, in the callback, if group 1 matched, just return group 1.
If group 3 matched, replace with what you want to replace it with.
The regex
'~(<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:s+(?>"[Ss]*?"|'[Ss]*?'|(?:(?!/>)[^>])?)+)?s*>)[Ss]*?</2s*(?=>))|(?:/?[w:]+s*/?)|(?:[w:]+s+(?:"[Ss]*?"|'[Ss]*?'|[^>]?)+s*/?)|?[Ss]*??|(?:!(?:(?:DOCTYPE[Ss]*?)|(?:[CDATA[[Ss]*?]])|(?:--[Ss]*?--)|(?:ATTLIST[Ss]*?)|(?:ENTITY[Ss]*?)|(?:ELEMENT[Ss]*?))))>)|(text|simple)~'
https://regex101.com/r/7ZGlvW/2
This regex is comparable to how SAX and DOM parsers parse tags.
I've posted this hundreds of times on SO.
Here is an example of how to remove all html tags:
https://regex101.com/r/oCVkZv/1
edited Aug 27 '17 at 0:58
answered Aug 27 '17 at 0:26
slnsln
26.4k31636
26.4k31636
This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage
– Paulo A. Costa
Aug 28 '17 at 1:22
@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I saidit is impossible to ignore a single tag without parsing all tags
. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...
– sln
Aug 28 '17 at 22:04
add a comment |
This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage
– Paulo A. Costa
Aug 28 '17 at 1:22
@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I saidit is impossible to ignore a single tag without parsing all tags
. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...
– sln
Aug 28 '17 at 22:04
This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage
– Paulo A. Costa
Aug 28 '17 at 1:22
This regEx works fine, but use a lot of memory, causing the error: Firefox: The connection was reset Chrome: (net::ERR_CONNECTION_RESET): The connection was reset. IE: Internet Explorer cannot display the webpage
– Paulo A. Costa
Aug 28 '17 at 1:22
@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I said
it is impossible to ignore a single tag without parsing all tags
. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...– sln
Aug 28 '17 at 22:04
@PauloACosta - I see you've accepted a skip/fail answer as I originally posted. But, as I said
it is impossible to ignore a single tag without parsing all tags
. And using skip/fail with my regex will be slower. Where you get that MEMORY problem is not from the regex. Otherwise, for speed, I said not to use skip/fail and instead just match both tags and text you need using my later regex. You made the wrong choice in an answer. That's too bad...– sln
Aug 28 '17 at 22:04
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45900099%2fregex-replace-text-outside-script-tag%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Do you absolutely need matching, or capturing groups will do?
– Vivick
Aug 26 '17 at 22:23
When you want to parse html with confidence, use an html parser not regex. SO says this over and over and over. IIRC there is even a note that the SO software pops up that says "don't use regex to parse html".
– mickmackusa
Aug 27 '17 at 2:48
@mickmackusa, but when you use a parser they stop working parsing a malformed html. I think this question is not a duplicate. 'Cause i´m not trying to strip tags, i´m trying replace content outside the tag "script".
– Paulo A. Costa
Aug 27 '17 at 3:00
Retracted dupe link, it is merely related.
– mickmackusa
Aug 27 '17 at 3:56