Do text and binary mode regex search always return the same result?












0















Python's doc says:




Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes).




But I was wondering whether searching with str and bytes would always give the same result. I mean, whether this function returns true, for all valid pattern and string:



#!/usr/bin/env python3

import re
def test(pattern, string):
m = re.search(pattern, string)
mb = re.search(pattern.encode(), string.encode())
if m is None and mb is None: return True
i, j = m.span(0)
ib, jb = mb.span(0)
return string[i:j].encode() == string.encode()[ib:jb]









share|improve this question





























    0















    Python's doc says:




    Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes).




    But I was wondering whether searching with str and bytes would always give the same result. I mean, whether this function returns true, for all valid pattern and string:



    #!/usr/bin/env python3

    import re
    def test(pattern, string):
    m = re.search(pattern, string)
    mb = re.search(pattern.encode(), string.encode())
    if m is None and mb is None: return True
    i, j = m.span(0)
    ib, jb = mb.span(0)
    return string[i:j].encode() == string.encode()[ib:jb]









    share|improve this question



























      0












      0








      0








      Python's doc says:




      Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes).




      But I was wondering whether searching with str and bytes would always give the same result. I mean, whether this function returns true, for all valid pattern and string:



      #!/usr/bin/env python3

      import re
      def test(pattern, string):
      m = re.search(pattern, string)
      mb = re.search(pattern.encode(), string.encode())
      if m is None and mb is None: return True
      i, j = m.span(0)
      ib, jb = mb.span(0)
      return string[i:j].encode() == string.encode()[ib:jb]









      share|improve this question
















      Python's doc says:




      Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes).




      But I was wondering whether searching with str and bytes would always give the same result. I mean, whether this function returns true, for all valid pattern and string:



      #!/usr/bin/env python3

      import re
      def test(pattern, string):
      m = re.search(pattern, string)
      mb = re.search(pattern.encode(), string.encode())
      if m is None and mb is None: return True
      i, j = m.span(0)
      ib, jb = mb.span(0)
      return string[i:j].encode() == string.encode()[ib:jb]






      python






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 18 at 18:01







      Cyker

















      asked Jan 18 at 17:43









      CykerCyker

      2,90953245




      2,90953245
























          2 Answers
          2






          active

          oldest

          votes


















          2














          answer: no



          example: test('[–]', '–')



          note that's an "en-dash" and not a hyphen — any non-ASCII character should have the same behaviour






          share|improve this answer


























          • Did that throw an exception? The initial code does not consider the case of no match, for brevity. Now it is updated.

            – Cyker
            Jan 18 at 17:54













          • nope, not with the original code nor the updated… I'm using Python 3.7 if that matters

            – Sam Mason
            Jan 18 at 17:54











          • Saw your updates. Doesn't seem to be a python version problem. More like a valid counter example...

            – Cyker
            Jan 18 at 17:55



















          1














          The main difference is in classes.



          For example, U+00FF is "ÿ", but is not an ASCII character. So w (Match "word character", or letters)



          re.search(r'w', 'xFF')  # match
          re.search(rb'w', b'xFF') # no match
          re.search(rb'w', 'xFF'.encode()) # still no match


          (And other unicode letters would work too)



          If you look at https://docs.python.org/3/library/re.html, you can see the three classes this applies to:




          d



          For Unicode (str) patterns:




          Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ASCII flag is used only [0-9] is matched.




          For 8-bit (bytes) patterns:




          Matches characters considered whitespace in the ASCII character set; this is equivalent to [ tnrfv].








          s



          For Unicode (str) patterns:




          Matches Unicode whitespace characters (which includes [ tnrfv], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ tnrfv] is matched.




          For 8-bit (bytes) patterns:




          Matches characters considered whitespace in the ASCII character set; this is equivalent to [ tnrfv].








          w



          For Unicode (str) patterns:




          Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.




          For 8-bit (bytes) patterns:




          Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.





          So if you set the ASCII flag, they should be mostly the same.



          For your exact function, an example would be: test(r'w|.', 'xFF')






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54259020%2fdo-text-and-binary-mode-regex-search-always-return-the-same-result%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            answer: no



            example: test('[–]', '–')



            note that's an "en-dash" and not a hyphen — any non-ASCII character should have the same behaviour






            share|improve this answer


























            • Did that throw an exception? The initial code does not consider the case of no match, for brevity. Now it is updated.

              – Cyker
              Jan 18 at 17:54













            • nope, not with the original code nor the updated… I'm using Python 3.7 if that matters

              – Sam Mason
              Jan 18 at 17:54











            • Saw your updates. Doesn't seem to be a python version problem. More like a valid counter example...

              – Cyker
              Jan 18 at 17:55
















            2














            answer: no



            example: test('[–]', '–')



            note that's an "en-dash" and not a hyphen — any non-ASCII character should have the same behaviour






            share|improve this answer


























            • Did that throw an exception? The initial code does not consider the case of no match, for brevity. Now it is updated.

              – Cyker
              Jan 18 at 17:54













            • nope, not with the original code nor the updated… I'm using Python 3.7 if that matters

              – Sam Mason
              Jan 18 at 17:54











            • Saw your updates. Doesn't seem to be a python version problem. More like a valid counter example...

              – Cyker
              Jan 18 at 17:55














            2












            2








            2







            answer: no



            example: test('[–]', '–')



            note that's an "en-dash" and not a hyphen — any non-ASCII character should have the same behaviour






            share|improve this answer















            answer: no



            example: test('[–]', '–')



            note that's an "en-dash" and not a hyphen — any non-ASCII character should have the same behaviour







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 18 at 17:54

























            answered Jan 18 at 17:51









            Sam MasonSam Mason

            3,25211330




            3,25211330













            • Did that throw an exception? The initial code does not consider the case of no match, for brevity. Now it is updated.

              – Cyker
              Jan 18 at 17:54













            • nope, not with the original code nor the updated… I'm using Python 3.7 if that matters

              – Sam Mason
              Jan 18 at 17:54











            • Saw your updates. Doesn't seem to be a python version problem. More like a valid counter example...

              – Cyker
              Jan 18 at 17:55



















            • Did that throw an exception? The initial code does not consider the case of no match, for brevity. Now it is updated.

              – Cyker
              Jan 18 at 17:54













            • nope, not with the original code nor the updated… I'm using Python 3.7 if that matters

              – Sam Mason
              Jan 18 at 17:54











            • Saw your updates. Doesn't seem to be a python version problem. More like a valid counter example...

              – Cyker
              Jan 18 at 17:55

















            Did that throw an exception? The initial code does not consider the case of no match, for brevity. Now it is updated.

            – Cyker
            Jan 18 at 17:54







            Did that throw an exception? The initial code does not consider the case of no match, for brevity. Now it is updated.

            – Cyker
            Jan 18 at 17:54















            nope, not with the original code nor the updated… I'm using Python 3.7 if that matters

            – Sam Mason
            Jan 18 at 17:54





            nope, not with the original code nor the updated… I'm using Python 3.7 if that matters

            – Sam Mason
            Jan 18 at 17:54













            Saw your updates. Doesn't seem to be a python version problem. More like a valid counter example...

            – Cyker
            Jan 18 at 17:55





            Saw your updates. Doesn't seem to be a python version problem. More like a valid counter example...

            – Cyker
            Jan 18 at 17:55













            1














            The main difference is in classes.



            For example, U+00FF is "ÿ", but is not an ASCII character. So w (Match "word character", or letters)



            re.search(r'w', 'xFF')  # match
            re.search(rb'w', b'xFF') # no match
            re.search(rb'w', 'xFF'.encode()) # still no match


            (And other unicode letters would work too)



            If you look at https://docs.python.org/3/library/re.html, you can see the three classes this applies to:




            d



            For Unicode (str) patterns:




            Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ASCII flag is used only [0-9] is matched.




            For 8-bit (bytes) patterns:




            Matches characters considered whitespace in the ASCII character set; this is equivalent to [ tnrfv].








            s



            For Unicode (str) patterns:




            Matches Unicode whitespace characters (which includes [ tnrfv], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ tnrfv] is matched.




            For 8-bit (bytes) patterns:




            Matches characters considered whitespace in the ASCII character set; this is equivalent to [ tnrfv].








            w



            For Unicode (str) patterns:




            Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.




            For 8-bit (bytes) patterns:




            Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.





            So if you set the ASCII flag, they should be mostly the same.



            For your exact function, an example would be: test(r'w|.', 'xFF')






            share|improve this answer




























              1














              The main difference is in classes.



              For example, U+00FF is "ÿ", but is not an ASCII character. So w (Match "word character", or letters)



              re.search(r'w', 'xFF')  # match
              re.search(rb'w', b'xFF') # no match
              re.search(rb'w', 'xFF'.encode()) # still no match


              (And other unicode letters would work too)



              If you look at https://docs.python.org/3/library/re.html, you can see the three classes this applies to:




              d



              For Unicode (str) patterns:




              Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ASCII flag is used only [0-9] is matched.




              For 8-bit (bytes) patterns:




              Matches characters considered whitespace in the ASCII character set; this is equivalent to [ tnrfv].








              s



              For Unicode (str) patterns:




              Matches Unicode whitespace characters (which includes [ tnrfv], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ tnrfv] is matched.




              For 8-bit (bytes) patterns:




              Matches characters considered whitespace in the ASCII character set; this is equivalent to [ tnrfv].








              w



              For Unicode (str) patterns:




              Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.




              For 8-bit (bytes) patterns:




              Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.





              So if you set the ASCII flag, they should be mostly the same.



              For your exact function, an example would be: test(r'w|.', 'xFF')






              share|improve this answer


























                1












                1








                1







                The main difference is in classes.



                For example, U+00FF is "ÿ", but is not an ASCII character. So w (Match "word character", or letters)



                re.search(r'w', 'xFF')  # match
                re.search(rb'w', b'xFF') # no match
                re.search(rb'w', 'xFF'.encode()) # still no match


                (And other unicode letters would work too)



                If you look at https://docs.python.org/3/library/re.html, you can see the three classes this applies to:




                d



                For Unicode (str) patterns:




                Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ASCII flag is used only [0-9] is matched.




                For 8-bit (bytes) patterns:




                Matches characters considered whitespace in the ASCII character set; this is equivalent to [ tnrfv].








                s



                For Unicode (str) patterns:




                Matches Unicode whitespace characters (which includes [ tnrfv], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ tnrfv] is matched.




                For 8-bit (bytes) patterns:




                Matches characters considered whitespace in the ASCII character set; this is equivalent to [ tnrfv].








                w



                For Unicode (str) patterns:




                Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.




                For 8-bit (bytes) patterns:




                Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.





                So if you set the ASCII flag, they should be mostly the same.



                For your exact function, an example would be: test(r'w|.', 'xFF')






                share|improve this answer













                The main difference is in classes.



                For example, U+00FF is "ÿ", but is not an ASCII character. So w (Match "word character", or letters)



                re.search(r'w', 'xFF')  # match
                re.search(rb'w', b'xFF') # no match
                re.search(rb'w', 'xFF'.encode()) # still no match


                (And other unicode letters would work too)



                If you look at https://docs.python.org/3/library/re.html, you can see the three classes this applies to:




                d



                For Unicode (str) patterns:




                Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ASCII flag is used only [0-9] is matched.




                For 8-bit (bytes) patterns:




                Matches characters considered whitespace in the ASCII character set; this is equivalent to [ tnrfv].








                s



                For Unicode (str) patterns:




                Matches Unicode whitespace characters (which includes [ tnrfv], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ tnrfv] is matched.




                For 8-bit (bytes) patterns:




                Matches characters considered whitespace in the ASCII character set; this is equivalent to [ tnrfv].








                w



                For Unicode (str) patterns:




                Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.




                For 8-bit (bytes) patterns:




                Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.





                So if you set the ASCII flag, they should be mostly the same.



                For your exact function, an example would be: test(r'w|.', 'xFF')







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 18 at 18:04









                ArtyerArtyer

                4,583728




                4,583728






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54259020%2fdo-text-and-binary-mode-regex-search-always-return-the-same-result%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How fix org.hibernate.TransientPropertyValueException

                    Updating UILabel text programmatically using a function

                    Cloud Functions - OpenCV Videocapture Read method fails for larger files from cloud storage