Faster implementation of pandas apply function












3















I have a pandas dataFrame in which I would like to check if one column is contained in another.



Suppose:



df = DataFrame({'A': ['some text here', 'another text', 'and this'], 
'B': ['some', 'somethin', 'this']})


I would like to check if df.B[0] is in df.A[0], df.B[1] is in df.A[1] etc.



Current approach



I have the following apply function implementation



df.apply(lambda x: x[1] in x[0], axis=1)


result is a Series of [True, False, True]



which is fine, but for my dataFrame shape (it is in the millions) it takes quite long.

Is there a better (i.e. faster) implamentation?



Unsuccesfull approach



I tried the pandas.Series.str.contains approach, but it can only take a string for the pattern.



df['A'].str.contains(df['B'], regex=False)









share|improve this question





























    3















    I have a pandas dataFrame in which I would like to check if one column is contained in another.



    Suppose:



    df = DataFrame({'A': ['some text here', 'another text', 'and this'], 
    'B': ['some', 'somethin', 'this']})


    I would like to check if df.B[0] is in df.A[0], df.B[1] is in df.A[1] etc.



    Current approach



    I have the following apply function implementation



    df.apply(lambda x: x[1] in x[0], axis=1)


    result is a Series of [True, False, True]



    which is fine, but for my dataFrame shape (it is in the millions) it takes quite long.

    Is there a better (i.e. faster) implamentation?



    Unsuccesfull approach



    I tried the pandas.Series.str.contains approach, but it can only take a string for the pattern.



    df['A'].str.contains(df['B'], regex=False)









    share|improve this question



























      3












      3








      3


      1






      I have a pandas dataFrame in which I would like to check if one column is contained in another.



      Suppose:



      df = DataFrame({'A': ['some text here', 'another text', 'and this'], 
      'B': ['some', 'somethin', 'this']})


      I would like to check if df.B[0] is in df.A[0], df.B[1] is in df.A[1] etc.



      Current approach



      I have the following apply function implementation



      df.apply(lambda x: x[1] in x[0], axis=1)


      result is a Series of [True, False, True]



      which is fine, but for my dataFrame shape (it is in the millions) it takes quite long.

      Is there a better (i.e. faster) implamentation?



      Unsuccesfull approach



      I tried the pandas.Series.str.contains approach, but it can only take a string for the pattern.



      df['A'].str.contains(df['B'], regex=False)









      share|improve this question
















      I have a pandas dataFrame in which I would like to check if one column is contained in another.



      Suppose:



      df = DataFrame({'A': ['some text here', 'another text', 'and this'], 
      'B': ['some', 'somethin', 'this']})


      I would like to check if df.B[0] is in df.A[0], df.B[1] is in df.A[1] etc.



      Current approach



      I have the following apply function implementation



      df.apply(lambda x: x[1] in x[0], axis=1)


      result is a Series of [True, False, True]



      which is fine, but for my dataFrame shape (it is in the millions) it takes quite long.

      Is there a better (i.e. faster) implamentation?



      Unsuccesfull approach



      I tried the pandas.Series.str.contains approach, but it can only take a string for the pattern.



      df['A'].str.contains(df['B'], regex=False)






      python string pandas apply






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 18 at 23:50









      coldspeed

      127k23128214




      127k23128214










      asked Dec 25 '17 at 17:55









      dimitris_psdimitris_ps

      3,80911436




      3,80911436
























          4 Answers
          4






          active

          oldest

          votes


















          6














          Use np.vectorize - bypasses the apply overhead, so should be a bit faster.



          v = np.vectorize(lambda x, y: y in x)

          v(df.A, df.B)
          array([ True, False, True], dtype=bool)




          Here's a timings comparison -



          df = pd.concat([df] * 10000)

          %timeit df.apply(lambda x: x[1] in x[0], axis=1)
          1 loop, best of 3: 1.32 s per loop

          %timeit v(df.A, df.B)
          100 loops, best of 3: 5.55 ms per loop

          # Psidom's answer
          %timeit [b in a for a, b in zip(df.A, df.B)]
          100 loops, best of 3: 3.34 ms per loop


          Both are pretty competitive options!



          Edit, adding timings for Wen's and Max's answers -



          # Wen's answer
          %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
          10 loops, best of 3: 49.1 ms per loop

          # MaxU's answer
          %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
          10 loops, best of 3: 87.8 ms per loop





          share|improve this answer


























          • This is great, thnx

            – dimitris_ps
            Dec 25 '17 at 18:07











          • @dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

            – coldspeed
            Dec 25 '17 at 18:12











          • Hi, can you test my speed :-)

            – W-B
            Dec 25 '17 at 18:22






          • 1





            @Wen Done! I don't know what it's doing, but I like it!

            – coldspeed
            Dec 25 '17 at 18:25






          • 1





            This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

            – W-B
            Dec 25 '17 at 18:27



















          5














          Try zip, it's significantly faster then apply in this case:



          df = pd.concat([df] * 10000)
          df.head()
          # A B
          #0 some text here some
          #1 another text somethin
          #2 and this this
          #0 some text here some
          #1 another text somethin

          %timeit df.apply(lambda x: x[1] in x[0], axis=1)
          # 1 loop, best of 3: 697 ms per loop

          %timeit [b in a for a, b in zip(df.A, df.B)]
          # 100 loops, best of 3: 3.53 ms per loop

          # @coldspeed's np.vectorize solution
          %timeit v(df.A, df.B)
          # 100 loops, best of 3: 4.18 ms per loop





          share|improve this answer
























          • This is great, thnx

            – dimitris_ps
            Dec 25 '17 at 18:07











          • I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

            – dimitris_ps
            Dec 25 '17 at 18:19



















          3














          UPDATE: we can also try to use numba:



          from numba import jit

          @jit
          def check_b_in_a(a,b):
          result = np.zeros(len(a)).astype('bool')
          for i in range(len(a)):
          t = b[i] in a[i]
          if t:
          result[i] = t
          return result

          In [100]: check_b_in_a(df.A.values, df.B.values)
          Out[100]: array([ True, False, True], dtype=bool)


          yet another vectorized solution:



          In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
          Out[50]:
          0 True
          1 False
          2 True
          dtype: bool


          NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:



          In [51]: df = pd.concat([df] * 10000)

          # Psidom
          In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]
          7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

          # cᴏʟᴅsᴘᴇᴇᴅ
          In [53]: %timeit v(df.A, df.B)
          15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

          # MaxU (1)
          In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
          185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

          # MaxU (2)
          In [103]: %timeit check_b_in_a(df.A.values, df.B.values)
          22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

          # Wen
          In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
          134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)





          share|improve this answer


























          • Look ma, no loops! I like this one too.

            – coldspeed
            Dec 25 '17 at 18:26











          • @cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

            – MaxU
            Dec 25 '17 at 18:28








          • 2





            Actually mine is slower, by a decade or two. Thnx

            – dimitris_ps
            Dec 25 '17 at 18:30



















          3














          Using the replace and nan infection



          df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
          Out[84]:
          0 True
          1 False
          2 True
          Name: A, dtype: bool


          To fix your code



          df['A'].str.contains('|'.join(df.B.tolist()))
          Out[91]:
          0 True
          1 False
          2 True
          Name: A, dtype: bool





          share|improve this answer

























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47970891%2ffaster-implementation-of-pandas-apply-function%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            4 Answers
            4






            active

            oldest

            votes








            4 Answers
            4






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            6














            Use np.vectorize - bypasses the apply overhead, so should be a bit faster.



            v = np.vectorize(lambda x, y: y in x)

            v(df.A, df.B)
            array([ True, False, True], dtype=bool)




            Here's a timings comparison -



            df = pd.concat([df] * 10000)

            %timeit df.apply(lambda x: x[1] in x[0], axis=1)
            1 loop, best of 3: 1.32 s per loop

            %timeit v(df.A, df.B)
            100 loops, best of 3: 5.55 ms per loop

            # Psidom's answer
            %timeit [b in a for a, b in zip(df.A, df.B)]
            100 loops, best of 3: 3.34 ms per loop


            Both are pretty competitive options!



            Edit, adding timings for Wen's and Max's answers -



            # Wen's answer
            %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
            10 loops, best of 3: 49.1 ms per loop

            # MaxU's answer
            %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            10 loops, best of 3: 87.8 ms per loop





            share|improve this answer


























            • This is great, thnx

              – dimitris_ps
              Dec 25 '17 at 18:07











            • @dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

              – coldspeed
              Dec 25 '17 at 18:12











            • Hi, can you test my speed :-)

              – W-B
              Dec 25 '17 at 18:22






            • 1





              @Wen Done! I don't know what it's doing, but I like it!

              – coldspeed
              Dec 25 '17 at 18:25






            • 1





              This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

              – W-B
              Dec 25 '17 at 18:27
















            6














            Use np.vectorize - bypasses the apply overhead, so should be a bit faster.



            v = np.vectorize(lambda x, y: y in x)

            v(df.A, df.B)
            array([ True, False, True], dtype=bool)




            Here's a timings comparison -



            df = pd.concat([df] * 10000)

            %timeit df.apply(lambda x: x[1] in x[0], axis=1)
            1 loop, best of 3: 1.32 s per loop

            %timeit v(df.A, df.B)
            100 loops, best of 3: 5.55 ms per loop

            # Psidom's answer
            %timeit [b in a for a, b in zip(df.A, df.B)]
            100 loops, best of 3: 3.34 ms per loop


            Both are pretty competitive options!



            Edit, adding timings for Wen's and Max's answers -



            # Wen's answer
            %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
            10 loops, best of 3: 49.1 ms per loop

            # MaxU's answer
            %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            10 loops, best of 3: 87.8 ms per loop





            share|improve this answer


























            • This is great, thnx

              – dimitris_ps
              Dec 25 '17 at 18:07











            • @dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

              – coldspeed
              Dec 25 '17 at 18:12











            • Hi, can you test my speed :-)

              – W-B
              Dec 25 '17 at 18:22






            • 1





              @Wen Done! I don't know what it's doing, but I like it!

              – coldspeed
              Dec 25 '17 at 18:25






            • 1





              This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

              – W-B
              Dec 25 '17 at 18:27














            6












            6








            6







            Use np.vectorize - bypasses the apply overhead, so should be a bit faster.



            v = np.vectorize(lambda x, y: y in x)

            v(df.A, df.B)
            array([ True, False, True], dtype=bool)




            Here's a timings comparison -



            df = pd.concat([df] * 10000)

            %timeit df.apply(lambda x: x[1] in x[0], axis=1)
            1 loop, best of 3: 1.32 s per loop

            %timeit v(df.A, df.B)
            100 loops, best of 3: 5.55 ms per loop

            # Psidom's answer
            %timeit [b in a for a, b in zip(df.A, df.B)]
            100 loops, best of 3: 3.34 ms per loop


            Both are pretty competitive options!



            Edit, adding timings for Wen's and Max's answers -



            # Wen's answer
            %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
            10 loops, best of 3: 49.1 ms per loop

            # MaxU's answer
            %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            10 loops, best of 3: 87.8 ms per loop





            share|improve this answer















            Use np.vectorize - bypasses the apply overhead, so should be a bit faster.



            v = np.vectorize(lambda x, y: y in x)

            v(df.A, df.B)
            array([ True, False, True], dtype=bool)




            Here's a timings comparison -



            df = pd.concat([df] * 10000)

            %timeit df.apply(lambda x: x[1] in x[0], axis=1)
            1 loop, best of 3: 1.32 s per loop

            %timeit v(df.A, df.B)
            100 loops, best of 3: 5.55 ms per loop

            # Psidom's answer
            %timeit [b in a for a, b in zip(df.A, df.B)]
            100 loops, best of 3: 3.34 ms per loop


            Both are pretty competitive options!



            Edit, adding timings for Wen's and Max's answers -



            # Wen's answer
            %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
            10 loops, best of 3: 49.1 ms per loop

            # MaxU's answer
            %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            10 loops, best of 3: 87.8 ms per loop






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Dec 25 '17 at 18:24

























            answered Dec 25 '17 at 18:00









            coldspeedcoldspeed

            127k23128214




            127k23128214













            • This is great, thnx

              – dimitris_ps
              Dec 25 '17 at 18:07











            • @dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

              – coldspeed
              Dec 25 '17 at 18:12











            • Hi, can you test my speed :-)

              – W-B
              Dec 25 '17 at 18:22






            • 1





              @Wen Done! I don't know what it's doing, but I like it!

              – coldspeed
              Dec 25 '17 at 18:25






            • 1





              This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

              – W-B
              Dec 25 '17 at 18:27



















            • This is great, thnx

              – dimitris_ps
              Dec 25 '17 at 18:07











            • @dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

              – coldspeed
              Dec 25 '17 at 18:12











            • Hi, can you test my speed :-)

              – W-B
              Dec 25 '17 at 18:22






            • 1





              @Wen Done! I don't know what it's doing, but I like it!

              – coldspeed
              Dec 25 '17 at 18:25






            • 1





              This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

              – W-B
              Dec 25 '17 at 18:27

















            This is great, thnx

            – dimitris_ps
            Dec 25 '17 at 18:07





            This is great, thnx

            – dimitris_ps
            Dec 25 '17 at 18:07













            @dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

            – coldspeed
            Dec 25 '17 at 18:12





            @dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

            – coldspeed
            Dec 25 '17 at 18:12













            Hi, can you test my speed :-)

            – W-B
            Dec 25 '17 at 18:22





            Hi, can you test my speed :-)

            – W-B
            Dec 25 '17 at 18:22




            1




            1





            @Wen Done! I don't know what it's doing, but I like it!

            – coldspeed
            Dec 25 '17 at 18:25





            @Wen Done! I don't know what it's doing, but I like it!

            – coldspeed
            Dec 25 '17 at 18:25




            1




            1





            This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

            – W-B
            Dec 25 '17 at 18:27





            This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

            – W-B
            Dec 25 '17 at 18:27













            5














            Try zip, it's significantly faster then apply in this case:



            df = pd.concat([df] * 10000)
            df.head()
            # A B
            #0 some text here some
            #1 another text somethin
            #2 and this this
            #0 some text here some
            #1 another text somethin

            %timeit df.apply(lambda x: x[1] in x[0], axis=1)
            # 1 loop, best of 3: 697 ms per loop

            %timeit [b in a for a, b in zip(df.A, df.B)]
            # 100 loops, best of 3: 3.53 ms per loop

            # @coldspeed's np.vectorize solution
            %timeit v(df.A, df.B)
            # 100 loops, best of 3: 4.18 ms per loop





            share|improve this answer
























            • This is great, thnx

              – dimitris_ps
              Dec 25 '17 at 18:07











            • I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

              – dimitris_ps
              Dec 25 '17 at 18:19
















            5














            Try zip, it's significantly faster then apply in this case:



            df = pd.concat([df] * 10000)
            df.head()
            # A B
            #0 some text here some
            #1 another text somethin
            #2 and this this
            #0 some text here some
            #1 another text somethin

            %timeit df.apply(lambda x: x[1] in x[0], axis=1)
            # 1 loop, best of 3: 697 ms per loop

            %timeit [b in a for a, b in zip(df.A, df.B)]
            # 100 loops, best of 3: 3.53 ms per loop

            # @coldspeed's np.vectorize solution
            %timeit v(df.A, df.B)
            # 100 loops, best of 3: 4.18 ms per loop





            share|improve this answer
























            • This is great, thnx

              – dimitris_ps
              Dec 25 '17 at 18:07











            • I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

              – dimitris_ps
              Dec 25 '17 at 18:19














            5












            5








            5







            Try zip, it's significantly faster then apply in this case:



            df = pd.concat([df] * 10000)
            df.head()
            # A B
            #0 some text here some
            #1 another text somethin
            #2 and this this
            #0 some text here some
            #1 another text somethin

            %timeit df.apply(lambda x: x[1] in x[0], axis=1)
            # 1 loop, best of 3: 697 ms per loop

            %timeit [b in a for a, b in zip(df.A, df.B)]
            # 100 loops, best of 3: 3.53 ms per loop

            # @coldspeed's np.vectorize solution
            %timeit v(df.A, df.B)
            # 100 loops, best of 3: 4.18 ms per loop





            share|improve this answer













            Try zip, it's significantly faster then apply in this case:



            df = pd.concat([df] * 10000)
            df.head()
            # A B
            #0 some text here some
            #1 another text somethin
            #2 and this this
            #0 some text here some
            #1 another text somethin

            %timeit df.apply(lambda x: x[1] in x[0], axis=1)
            # 1 loop, best of 3: 697 ms per loop

            %timeit [b in a for a, b in zip(df.A, df.B)]
            # 100 loops, best of 3: 3.53 ms per loop

            # @coldspeed's np.vectorize solution
            %timeit v(df.A, df.B)
            # 100 loops, best of 3: 4.18 ms per loop






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Dec 25 '17 at 18:00









            PsidomPsidom

            123k1285127




            123k1285127













            • This is great, thnx

              – dimitris_ps
              Dec 25 '17 at 18:07











            • I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

              – dimitris_ps
              Dec 25 '17 at 18:19



















            • This is great, thnx

              – dimitris_ps
              Dec 25 '17 at 18:07











            • I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

              – dimitris_ps
              Dec 25 '17 at 18:19

















            This is great, thnx

            – dimitris_ps
            Dec 25 '17 at 18:07





            This is great, thnx

            – dimitris_ps
            Dec 25 '17 at 18:07













            I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

            – dimitris_ps
            Dec 25 '17 at 18:19





            I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

            – dimitris_ps
            Dec 25 '17 at 18:19











            3














            UPDATE: we can also try to use numba:



            from numba import jit

            @jit
            def check_b_in_a(a,b):
            result = np.zeros(len(a)).astype('bool')
            for i in range(len(a)):
            t = b[i] in a[i]
            if t:
            result[i] = t
            return result

            In [100]: check_b_in_a(df.A.values, df.B.values)
            Out[100]: array([ True, False, True], dtype=bool)


            yet another vectorized solution:



            In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            Out[50]:
            0 True
            1 False
            2 True
            dtype: bool


            NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:



            In [51]: df = pd.concat([df] * 10000)

            # Psidom
            In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]
            7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

            # cᴏʟᴅsᴘᴇᴇᴅ
            In [53]: %timeit v(df.A, df.B)
            15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

            # MaxU (1)
            In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

            # MaxU (2)
            In [103]: %timeit check_b_in_a(df.A.values, df.B.values)
            22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

            # Wen
            In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
            134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)





            share|improve this answer


























            • Look ma, no loops! I like this one too.

              – coldspeed
              Dec 25 '17 at 18:26











            • @cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

              – MaxU
              Dec 25 '17 at 18:28








            • 2





              Actually mine is slower, by a decade or two. Thnx

              – dimitris_ps
              Dec 25 '17 at 18:30
















            3














            UPDATE: we can also try to use numba:



            from numba import jit

            @jit
            def check_b_in_a(a,b):
            result = np.zeros(len(a)).astype('bool')
            for i in range(len(a)):
            t = b[i] in a[i]
            if t:
            result[i] = t
            return result

            In [100]: check_b_in_a(df.A.values, df.B.values)
            Out[100]: array([ True, False, True], dtype=bool)


            yet another vectorized solution:



            In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            Out[50]:
            0 True
            1 False
            2 True
            dtype: bool


            NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:



            In [51]: df = pd.concat([df] * 10000)

            # Psidom
            In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]
            7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

            # cᴏʟᴅsᴘᴇᴇᴅ
            In [53]: %timeit v(df.A, df.B)
            15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

            # MaxU (1)
            In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

            # MaxU (2)
            In [103]: %timeit check_b_in_a(df.A.values, df.B.values)
            22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

            # Wen
            In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
            134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)





            share|improve this answer


























            • Look ma, no loops! I like this one too.

              – coldspeed
              Dec 25 '17 at 18:26











            • @cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

              – MaxU
              Dec 25 '17 at 18:28








            • 2





              Actually mine is slower, by a decade or two. Thnx

              – dimitris_ps
              Dec 25 '17 at 18:30














            3












            3








            3







            UPDATE: we can also try to use numba:



            from numba import jit

            @jit
            def check_b_in_a(a,b):
            result = np.zeros(len(a)).astype('bool')
            for i in range(len(a)):
            t = b[i] in a[i]
            if t:
            result[i] = t
            return result

            In [100]: check_b_in_a(df.A.values, df.B.values)
            Out[100]: array([ True, False, True], dtype=bool)


            yet another vectorized solution:



            In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            Out[50]:
            0 True
            1 False
            2 True
            dtype: bool


            NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:



            In [51]: df = pd.concat([df] * 10000)

            # Psidom
            In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]
            7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

            # cᴏʟᴅsᴘᴇᴇᴅ
            In [53]: %timeit v(df.A, df.B)
            15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

            # MaxU (1)
            In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

            # MaxU (2)
            In [103]: %timeit check_b_in_a(df.A.values, df.B.values)
            22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

            # Wen
            In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
            134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)





            share|improve this answer















            UPDATE: we can also try to use numba:



            from numba import jit

            @jit
            def check_b_in_a(a,b):
            result = np.zeros(len(a)).astype('bool')
            for i in range(len(a)):
            t = b[i] in a[i]
            if t:
            result[i] = t
            return result

            In [100]: check_b_in_a(df.A.values, df.B.values)
            Out[100]: array([ True, False, True], dtype=bool)


            yet another vectorized solution:



            In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            Out[50]:
            0 True
            1 False
            2 True
            dtype: bool


            NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:



            In [51]: df = pd.concat([df] * 10000)

            # Psidom
            In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]
            7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

            # cᴏʟᴅsᴘᴇᴇᴅ
            In [53]: %timeit v(df.A, df.B)
            15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

            # MaxU (1)
            In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
            185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

            # MaxU (2)
            In [103]: %timeit check_b_in_a(df.A.values, df.B.values)
            22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

            # Wen
            In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
            134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Dec 25 '17 at 19:59

























            answered Dec 25 '17 at 18:22









            MaxUMaxU

            121k12117169




            121k12117169













            • Look ma, no loops! I like this one too.

              – coldspeed
              Dec 25 '17 at 18:26











            • @cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

              – MaxU
              Dec 25 '17 at 18:28








            • 2





              Actually mine is slower, by a decade or two. Thnx

              – dimitris_ps
              Dec 25 '17 at 18:30



















            • Look ma, no loops! I like this one too.

              – coldspeed
              Dec 25 '17 at 18:26











            • @cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

              – MaxU
              Dec 25 '17 at 18:28








            • 2





              Actually mine is slower, by a decade or two. Thnx

              – dimitris_ps
              Dec 25 '17 at 18:30

















            Look ma, no loops! I like this one too.

            – coldspeed
            Dec 25 '17 at 18:26





            Look ma, no loops! I like this one too.

            – coldspeed
            Dec 25 '17 at 18:26













            @cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

            – MaxU
            Dec 25 '17 at 18:28







            @cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

            – MaxU
            Dec 25 '17 at 18:28






            2




            2





            Actually mine is slower, by a decade or two. Thnx

            – dimitris_ps
            Dec 25 '17 at 18:30





            Actually mine is slower, by a decade or two. Thnx

            – dimitris_ps
            Dec 25 '17 at 18:30











            3














            Using the replace and nan infection



            df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
            Out[84]:
            0 True
            1 False
            2 True
            Name: A, dtype: bool


            To fix your code



            df['A'].str.contains('|'.join(df.B.tolist()))
            Out[91]:
            0 True
            1 False
            2 True
            Name: A, dtype: bool





            share|improve this answer






























              3














              Using the replace and nan infection



              df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
              Out[84]:
              0 True
              1 False
              2 True
              Name: A, dtype: bool


              To fix your code



              df['A'].str.contains('|'.join(df.B.tolist()))
              Out[91]:
              0 True
              1 False
              2 True
              Name: A, dtype: bool





              share|improve this answer




























                3












                3








                3







                Using the replace and nan infection



                df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
                Out[84]:
                0 True
                1 False
                2 True
                Name: A, dtype: bool


                To fix your code



                df['A'].str.contains('|'.join(df.B.tolist()))
                Out[91]:
                0 True
                1 False
                2 True
                Name: A, dtype: bool





                share|improve this answer















                Using the replace and nan infection



                df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
                Out[84]:
                0 True
                1 False
                2 True
                Name: A, dtype: bool


                To fix your code



                df['A'].str.contains('|'.join(df.B.tolist()))
                Out[91]:
                0 True
                1 False
                2 True
                Name: A, dtype: bool






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Dec 25 '17 at 21:06

























                answered Dec 25 '17 at 18:22









                W-BW-B

                107k83265




                107k83265






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47970891%2ffaster-implementation-of-pandas-apply-function%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Liquibase includeAll doesn't find base path

                    How to use setInterval in EJS file?

                    Petrus Granier-Deferre