Faster implementation of pandas apply function

I have a pandas dataFrame in which I would like to check if one column is contained in another.

Suppose:

df = DataFrame({'A': ['some text here', 'another text', 'and this'], 

                'B': ['some', 'somethin', 'this']})

I would like to check if df.B[0] is in df.A[0], df.B[1] is in df.A[1] etc.

Current approach

I have the following apply function implementation

df.apply(lambda x: x[1] in x[0], axis=1)

result is a Series of [True, False, True]

which is fine, but for my dataFrame shape (it is in the millions) it takes quite long.

Is there a better (i.e. faster) implamentation?

Unsuccesfull approach

I tried the pandas.Series.str.contains approach, but it can only take a string for the pattern.

df['A'].str.contains(df['B'], regex=False)

edited Jan 18 at 23:50

coldspeed

127k23128214

asked Dec 25 '17 at 17:55

dimitris_ps

3,80911436

add a comment |

I have a pandas dataFrame in which I would like to check if one column is contained in another.

Suppose:

df = DataFrame({'A': ['some text here', 'another text', 'and this'], 

                'B': ['some', 'somethin', 'this']})

I would like to check if df.B[0] is in df.A[0], df.B[1] is in df.A[1] etc.

Current approach

I have the following apply function implementation

df.apply(lambda x: x[1] in x[0], axis=1)

result is a Series of [True, False, True]

which is fine, but for my dataFrame shape (it is in the millions) it takes quite long.

Is there a better (i.e. faster) implamentation?

Unsuccesfull approach

I tried the pandas.Series.str.contains approach, but it can only take a string for the pattern.

df['A'].str.contains(df['B'], regex=False)

edited Jan 18 at 23:50

coldspeed

127k23128214

asked Dec 25 '17 at 17:55

dimitris_ps

3,80911436

add a comment |

I have a pandas dataFrame in which I would like to check if one column is contained in another.

Suppose:

df = DataFrame({'A': ['some text here', 'another text', 'and this'], 

                'B': ['some', 'somethin', 'this']})

I would like to check if df.B[0] is in df.A[0], df.B[1] is in df.A[1] etc.

Current approach

I have the following apply function implementation

df.apply(lambda x: x[1] in x[0], axis=1)

result is a Series of [True, False, True]

which is fine, but for my dataFrame shape (it is in the millions) it takes quite long.

Is there a better (i.e. faster) implamentation?

Unsuccesfull approach

I tried the pandas.Series.str.contains approach, but it can only take a string for the pattern.

df['A'].str.contains(df['B'], regex=False)

edited Jan 18 at 23:50

coldspeed

127k23128214

asked Dec 25 '17 at 17:55

dimitris_ps

3,80911436

I have a pandas dataFrame in which I would like to check if one column is contained in another.

Suppose:

df = DataFrame({'A': ['some text here', 'another text', 'and this'], 

                'B': ['some', 'somethin', 'this']})

I would like to check if df.B[0] is in df.A[0], df.B[1] is in df.A[1] etc.

Current approach

I have the following apply function implementation

df.apply(lambda x: x[1] in x[0], axis=1)

result is a Series of [True, False, True]

which is fine, but for my dataFrame shape (it is in the millions) it takes quite long.

Is there a better (i.e. faster) implamentation?

Unsuccesfull approach

I tried the pandas.Series.str.contains approach, but it can only take a string for the pattern.

df['A'].str.contains(df['B'], regex=False)

python string pandas apply

edited Jan 18 at 23:50

coldspeed

127k23128214

asked Dec 25 '17 at 17:55

dimitris_ps

3,80911436

edited Jan 18 at 23:50

coldspeed

127k23128214

asked Dec 25 '17 at 17:55

dimitris_ps

3,80911436

edited Jan 18 at 23:50

coldspeed

127k23128214

edited Jan 18 at 23:50

coldspeed

127k23128214

edited Jan 18 at 23:50

coldspeed

127k23128214

asked Dec 25 '17 at 17:55

dimitris_ps

3,80911436

asked Dec 25 '17 at 17:55

dimitris_ps

3,80911436

asked Dec 25 '17 at 17:55

dimitris_ps

3,80911436

add a comment |

4 Answers
4

active

oldest

votes

Use np.vectorize - bypasses the apply overhead, so should be a bit faster.

v = np.vectorize(lambda x, y: y in x)



v(df.A, df.B)

array([ True, False,  True], dtype=bool)

Here's a timings comparison -

df = pd.concat([df] * 10000)



%timeit df.apply(lambda x: x[1] in x[0], axis=1)

1 loop, best of 3: 1.32 s per loop



%timeit v(df.A, df.B)

100 loops, best of 3: 5.55 ms per loop



# Psidom's answer

%timeit [b in a for a, b in zip(df.A, df.B)]

100 loops, best of 3: 3.34 ms per loop

Both are pretty competitive options!

Edit, adding timings for Wen's and Max's answers -

# Wen's answer

%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

10 loops, best of 3: 49.1 ms per loop



# MaxU's answer

%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

10 loops, best of 3: 87.8 ms per loop

edited Dec 25 '17 at 18:24

answered Dec 25 '17 at 18:00

coldspeed

127k23128214

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

– coldspeed
Dec 25 '17 at 18:12

Hi, can you test my speed :-)

– W-B
Dec 25 '17 at 18:22

1

@Wen Done! I don't know what it's doing, but I like it!

– coldspeed
Dec 25 '17 at 18:25

1

This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

– W-B
Dec 25 '17 at 18:27

add a comment |

Try zip, it's significantly faster then apply in this case:

df = pd.concat([df] * 10000)

df.head()

#                A         B

#0  some text here      some

#1    another text  somethin

#2        and this      this

#0  some text here      some

#1    another text  somethin



%timeit df.apply(lambda x: x[1] in x[0], axis=1)

# 1 loop, best of 3: 697 ms per loop



%timeit [b in a for a, b in zip(df.A, df.B)]

# 100 loops, best of 3: 3.53 ms per loop



# @coldspeed's np.vectorize solution

%timeit v(df.A, df.B)

# 100 loops, best of 3: 4.18 ms per loop

answered Dec 25 '17 at 18:00

Psidom

123k1285127

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

– dimitris_ps
Dec 25 '17 at 18:19

add a comment |

UPDATE: we can also try to use numba:

from numba import jit



@jit

def check_b_in_a(a,b):

    result = np.zeros(len(a)).astype('bool')

    for i in range(len(a)):

        t = b[i] in a[i]

        if t:

            result[i] = t

    return result



In [100]: check_b_in_a(df.A.values, df.B.values)

Out[100]: array([ True, False,  True], dtype=bool)

yet another vectorized solution:

In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

Out[50]:

0     True

1    False

2     True

dtype: bool

NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:

In [51]: df = pd.concat([df] * 10000)



# Psidom

In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]

7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



# cᴏʟᴅsᴘᴇᴇᴅ

In [53]: %timeit v(df.A, df.B)

15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



# MaxU (1)    

In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



# MaxU (2)    

In [103]: %timeit check_b_in_a(df.A.values, df.B.values)

22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



# Wen

In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Dec 25 '17 at 19:59

answered Dec 25 '17 at 18:22

MaxU

121k12117169

Look ma, no loops! I like this one too.

– coldspeed
Dec 25 '17 at 18:26

@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

– MaxU
Dec 25 '17 at 18:28

2

Actually mine is slower, by a decade or two. Thnx

– dimitris_ps
Dec 25 '17 at 18:30

add a comment |

Using the replace and nan infection

df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

Out[84]: 

0     True

1    False

2     True

Name: A, dtype: bool

To fix your code

df['A'].str.contains('|'.join(df.B.tolist()))

Out[91]: 

0     True

1    False

2     True

Name: A, dtype: bool

edited Dec 25 '17 at 21:06

answered Dec 25 '17 at 18:22

W-B

107k83265

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47970891%2ffaster-implementation-of-pandas-apply-function%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Use np.vectorize - bypasses the apply overhead, so should be a bit faster.

v = np.vectorize(lambda x, y: y in x)



v(df.A, df.B)

array([ True, False,  True], dtype=bool)

Here's a timings comparison -

df = pd.concat([df] * 10000)



%timeit df.apply(lambda x: x[1] in x[0], axis=1)

1 loop, best of 3: 1.32 s per loop



%timeit v(df.A, df.B)

100 loops, best of 3: 5.55 ms per loop



# Psidom's answer

%timeit [b in a for a, b in zip(df.A, df.B)]

100 loops, best of 3: 3.34 ms per loop

Both are pretty competitive options!

Edit, adding timings for Wen's and Max's answers -

# Wen's answer

%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

10 loops, best of 3: 49.1 ms per loop



# MaxU's answer

%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

10 loops, best of 3: 87.8 ms per loop

edited Dec 25 '17 at 18:24

answered Dec 25 '17 at 18:00

coldspeed

127k23128214

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

– coldspeed
Dec 25 '17 at 18:12

Hi, can you test my speed :-)

– W-B
Dec 25 '17 at 18:22

1

@Wen Done! I don't know what it's doing, but I like it!

– coldspeed
Dec 25 '17 at 18:25

1

This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

– W-B
Dec 25 '17 at 18:27

add a comment |

Use np.vectorize - bypasses the apply overhead, so should be a bit faster.

v = np.vectorize(lambda x, y: y in x)



v(df.A, df.B)

array([ True, False,  True], dtype=bool)

Here's a timings comparison -

df = pd.concat([df] * 10000)



%timeit df.apply(lambda x: x[1] in x[0], axis=1)

1 loop, best of 3: 1.32 s per loop



%timeit v(df.A, df.B)

100 loops, best of 3: 5.55 ms per loop



# Psidom's answer

%timeit [b in a for a, b in zip(df.A, df.B)]

100 loops, best of 3: 3.34 ms per loop

Both are pretty competitive options!

Edit, adding timings for Wen's and Max's answers -

# Wen's answer

%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

10 loops, best of 3: 49.1 ms per loop



# MaxU's answer

%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

10 loops, best of 3: 87.8 ms per loop

edited Dec 25 '17 at 18:24

answered Dec 25 '17 at 18:00

coldspeed

127k23128214

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

– coldspeed
Dec 25 '17 at 18:12

Hi, can you test my speed :-)

– W-B
Dec 25 '17 at 18:22

1

@Wen Done! I don't know what it's doing, but I like it!

– coldspeed
Dec 25 '17 at 18:25

1

This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

– W-B
Dec 25 '17 at 18:27

add a comment |

Use np.vectorize - bypasses the apply overhead, so should be a bit faster.

v = np.vectorize(lambda x, y: y in x)



v(df.A, df.B)

array([ True, False,  True], dtype=bool)

Here's a timings comparison -

df = pd.concat([df] * 10000)



%timeit df.apply(lambda x: x[1] in x[0], axis=1)

1 loop, best of 3: 1.32 s per loop



%timeit v(df.A, df.B)

100 loops, best of 3: 5.55 ms per loop



# Psidom's answer

%timeit [b in a for a, b in zip(df.A, df.B)]

100 loops, best of 3: 3.34 ms per loop

Both are pretty competitive options!

Edit, adding timings for Wen's and Max's answers -

# Wen's answer

%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

10 loops, best of 3: 49.1 ms per loop



# MaxU's answer

%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

10 loops, best of 3: 87.8 ms per loop

edited Dec 25 '17 at 18:24

answered Dec 25 '17 at 18:00

coldspeed

127k23128214

Use np.vectorize - bypasses the apply overhead, so should be a bit faster.

v = np.vectorize(lambda x, y: y in x)



v(df.A, df.B)

array([ True, False,  True], dtype=bool)

Here's a timings comparison -

df = pd.concat([df] * 10000)



%timeit df.apply(lambda x: x[1] in x[0], axis=1)

1 loop, best of 3: 1.32 s per loop



%timeit v(df.A, df.B)

100 loops, best of 3: 5.55 ms per loop



# Psidom's answer

%timeit [b in a for a, b in zip(df.A, df.B)]

100 loops, best of 3: 3.34 ms per loop

Both are pretty competitive options!

Edit, adding timings for Wen's and Max's answers -

# Wen's answer

%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

10 loops, best of 3: 49.1 ms per loop



# MaxU's answer

%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

10 loops, best of 3: 87.8 ms per loop

edited Dec 25 '17 at 18:24

answered Dec 25 '17 at 18:00

coldspeed

127k23128214

edited Dec 25 '17 at 18:24

answered Dec 25 '17 at 18:00

coldspeed

127k23128214

answered Dec 25 '17 at 18:00

coldspeed

127k23128214

answered Dec 25 '17 at 18:00

coldspeed

127k23128214

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

– coldspeed
Dec 25 '17 at 18:12

Hi, can you test my speed :-)

– W-B
Dec 25 '17 at 18:22

1

@Wen Done! I don't know what it's doing, but I like it!

– coldspeed
Dec 25 '17 at 18:25

1

This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

– W-B
Dec 25 '17 at 18:27

add a comment |

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

– coldspeed
Dec 25 '17 at 18:12

Hi, can you test my speed :-)

– W-B
Dec 25 '17 at 18:22

1

@Wen Done! I don't know what it's doing, but I like it!

– coldspeed
Dec 25 '17 at 18:25

1

This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

– W-B
Dec 25 '17 at 18:27

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass df.*.values instead of df.* to v.

– coldspeed
Dec 25 '17 at 18:12

Hi, can you test my speed :-)

– W-B
Dec 25 '17 at 18:22

@Wen Done! I don't know what it's doing, but I like it!

– coldspeed
Dec 25 '17 at 18:25

This is a small trick by np.nan infection :-) stackoverflow.com/questions/46944650/…

– W-B
Dec 25 '17 at 18:27

add a comment |

Try zip, it's significantly faster then apply in this case:

df = pd.concat([df] * 10000)

df.head()

#                A         B

#0  some text here      some

#1    another text  somethin

#2        and this      this

#0  some text here      some

#1    another text  somethin



%timeit df.apply(lambda x: x[1] in x[0], axis=1)

# 1 loop, best of 3: 697 ms per loop



%timeit [b in a for a, b in zip(df.A, df.B)]

# 100 loops, best of 3: 3.53 ms per loop



# @coldspeed's np.vectorize solution

%timeit v(df.A, df.B)

# 100 loops, best of 3: 4.18 ms per loop

answered Dec 25 '17 at 18:00

Psidom

123k1285127

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

– dimitris_ps
Dec 25 '17 at 18:19

add a comment |

Try zip, it's significantly faster then apply in this case:

df = pd.concat([df] * 10000)

df.head()

#                A         B

#0  some text here      some

#1    another text  somethin

#2        and this      this

#0  some text here      some

#1    another text  somethin



%timeit df.apply(lambda x: x[1] in x[0], axis=1)

# 1 loop, best of 3: 697 ms per loop



%timeit [b in a for a, b in zip(df.A, df.B)]

# 100 loops, best of 3: 3.53 ms per loop



# @coldspeed's np.vectorize solution

%timeit v(df.A, df.B)

# 100 loops, best of 3: 4.18 ms per loop

answered Dec 25 '17 at 18:00

Psidom

123k1285127

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

– dimitris_ps
Dec 25 '17 at 18:19

add a comment |

Try zip, it's significantly faster then apply in this case:

df = pd.concat([df] * 10000)

df.head()

#                A         B

#0  some text here      some

#1    another text  somethin

#2        and this      this

#0  some text here      some

#1    another text  somethin



%timeit df.apply(lambda x: x[1] in x[0], axis=1)

# 1 loop, best of 3: 697 ms per loop



%timeit [b in a for a, b in zip(df.A, df.B)]

# 100 loops, best of 3: 3.53 ms per loop



# @coldspeed's np.vectorize solution

%timeit v(df.A, df.B)

# 100 loops, best of 3: 4.18 ms per loop

answered Dec 25 '17 at 18:00

Psidom

123k1285127

Try zip, it's significantly faster then apply in this case:

df = pd.concat([df] * 10000)

df.head()

#                A         B

#0  some text here      some

#1    another text  somethin

#2        and this      this

#0  some text here      some

#1    another text  somethin



%timeit df.apply(lambda x: x[1] in x[0], axis=1)

# 1 loop, best of 3: 697 ms per loop



%timeit [b in a for a, b in zip(df.A, df.B)]

# 100 loops, best of 3: 3.53 ms per loop



# @coldspeed's np.vectorize solution

%timeit v(df.A, df.B)

# 100 loops, best of 3: 4.18 ms per loop

answered Dec 25 '17 at 18:00

Psidom

123k1285127

answered Dec 25 '17 at 18:00

Psidom

123k1285127

answered Dec 25 '17 at 18:00

Psidom

123k1285127

answered Dec 25 '17 at 18:00

Psidom

123k1285127

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

– dimitris_ps
Dec 25 '17 at 18:19

add a comment |

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

– dimitris_ps
Dec 25 '17 at 18:19

This is great, thnx

– dimitris_ps
Dec 25 '17 at 18:07

I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!

– dimitris_ps
Dec 25 '17 at 18:19

add a comment |

UPDATE: we can also try to use numba:

from numba import jit



@jit

def check_b_in_a(a,b):

    result = np.zeros(len(a)).astype('bool')

    for i in range(len(a)):

        t = b[i] in a[i]

        if t:

            result[i] = t

    return result



In [100]: check_b_in_a(df.A.values, df.B.values)

Out[100]: array([ True, False,  True], dtype=bool)

yet another vectorized solution:

In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

Out[50]:

0     True

1    False

2     True

dtype: bool

NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:

In [51]: df = pd.concat([df] * 10000)



# Psidom

In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]

7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



# cᴏʟᴅsᴘᴇᴇᴅ

In [53]: %timeit v(df.A, df.B)

15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



# MaxU (1)    

In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



# MaxU (2)    

In [103]: %timeit check_b_in_a(df.A.values, df.B.values)

22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



# Wen

In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Dec 25 '17 at 19:59

answered Dec 25 '17 at 18:22

MaxU

121k12117169

Look ma, no loops! I like this one too.

– coldspeed
Dec 25 '17 at 18:26

@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

– MaxU
Dec 25 '17 at 18:28

2

Actually mine is slower, by a decade or two. Thnx

– dimitris_ps
Dec 25 '17 at 18:30

add a comment |

UPDATE: we can also try to use numba:

from numba import jit



@jit

def check_b_in_a(a,b):

    result = np.zeros(len(a)).astype('bool')

    for i in range(len(a)):

        t = b[i] in a[i]

        if t:

            result[i] = t

    return result



In [100]: check_b_in_a(df.A.values, df.B.values)

Out[100]: array([ True, False,  True], dtype=bool)

yet another vectorized solution:

In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

Out[50]:

0     True

1    False

2     True

dtype: bool

NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:

In [51]: df = pd.concat([df] * 10000)



# Psidom

In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]

7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



# cᴏʟᴅsᴘᴇᴇᴅ

In [53]: %timeit v(df.A, df.B)

15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



# MaxU (1)    

In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



# MaxU (2)    

In [103]: %timeit check_b_in_a(df.A.values, df.B.values)

22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



# Wen

In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Dec 25 '17 at 19:59

answered Dec 25 '17 at 18:22

MaxU

121k12117169

Look ma, no loops! I like this one too.

– coldspeed
Dec 25 '17 at 18:26

@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

– MaxU
Dec 25 '17 at 18:28

2

Actually mine is slower, by a decade or two. Thnx

– dimitris_ps
Dec 25 '17 at 18:30

add a comment |

UPDATE: we can also try to use numba:

from numba import jit



@jit

def check_b_in_a(a,b):

    result = np.zeros(len(a)).astype('bool')

    for i in range(len(a)):

        t = b[i] in a[i]

        if t:

            result[i] = t

    return result



In [100]: check_b_in_a(df.A.values, df.B.values)

Out[100]: array([ True, False,  True], dtype=bool)

yet another vectorized solution:

In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

Out[50]:

0     True

1    False

2     True

dtype: bool

NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:

In [51]: df = pd.concat([df] * 10000)



# Psidom

In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]

7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



# cᴏʟᴅsᴘᴇᴇᴅ

In [53]: %timeit v(df.A, df.B)

15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



# MaxU (1)    

In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



# MaxU (2)    

In [103]: %timeit check_b_in_a(df.A.values, df.B.values)

22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



# Wen

In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Dec 25 '17 at 19:59

answered Dec 25 '17 at 18:22

MaxU

121k12117169

UPDATE: we can also try to use numba:

from numba import jit



@jit

def check_b_in_a(a,b):

    result = np.zeros(len(a)).astype('bool')

    for i in range(len(a)):

        t = b[i] in a[i]

        if t:

            result[i] = t

    return result



In [100]: check_b_in_a(df.A.values, df.B.values)

Out[100]: array([ True, False,  True], dtype=bool)

yet another vectorized solution:

In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

Out[50]:

0     True

1    False

2     True

dtype: bool

NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:

In [51]: df = pd.concat([df] * 10000)



# Psidom

In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]

7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



# cᴏʟᴅsᴘᴇᴇᴅ

In [53]: %timeit v(df.A, df.B)

15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



# MaxU (1)    

In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)

185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



# MaxU (2)    

In [103]: %timeit check_b_in_a(df.A.values, df.B.values)

22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



# Wen

In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Dec 25 '17 at 19:59

answered Dec 25 '17 at 18:22

MaxU

121k12117169

edited Dec 25 '17 at 19:59

answered Dec 25 '17 at 18:22

MaxU

121k12117169

answered Dec 25 '17 at 18:22

MaxU

121k12117169

answered Dec 25 '17 at 18:22

MaxU

121k12117169

Look ma, no loops! I like this one too.

– coldspeed
Dec 25 '17 at 18:26

@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

– MaxU
Dec 25 '17 at 18:28

2

Actually mine is slower, by a decade or two. Thnx

– dimitris_ps
Dec 25 '17 at 18:30

add a comment |

Look ma, no loops! I like this one too.

– coldspeed
Dec 25 '17 at 18:26

@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

– MaxU
Dec 25 '17 at 18:28

2

Actually mine is slower, by a decade or two. Thnx

– dimitris_ps
Dec 25 '17 at 18:30

Look ma, no loops! I like this one too.

– coldspeed
Dec 25 '17 at 18:26

@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)

– MaxU
Dec 25 '17 at 18:28

Actually mine is slower, by a decade or two. Thnx

– dimitris_ps
Dec 25 '17 at 18:30

add a comment |

Using the replace and nan infection

df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

Out[84]: 

0     True

1    False

2     True

Name: A, dtype: bool

To fix your code

df['A'].str.contains('|'.join(df.B.tolist()))

Out[91]: 

0     True

1    False

2     True

Name: A, dtype: bool

edited Dec 25 '17 at 21:06

answered Dec 25 '17 at 18:22

W-B

107k83265

add a comment |

Using the replace and nan infection

df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

Out[84]: 

0     True

1    False

2     True

Name: A, dtype: bool

To fix your code

df['A'].str.contains('|'.join(df.B.tolist()))

Out[91]: 

0     True

1    False

2     True

Name: A, dtype: bool

edited Dec 25 '17 at 21:06

answered Dec 25 '17 at 18:22

W-B

107k83265

add a comment |

Using the replace and nan infection

df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

Out[84]: 

0     True

1    False

2     True

Name: A, dtype: bool

To fix your code

df['A'].str.contains('|'.join(df.B.tolist()))

Out[91]: 

0     True

1    False

2     True

Name: A, dtype: bool

edited Dec 25 '17 at 21:06

answered Dec 25 '17 at 18:22

W-B

107k83265

Using the replace and nan infection

df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()

Out[84]: 

0     True

1    False

2     True

Name: A, dtype: bool

To fix your code

df['A'].str.contains('|'.join(df.B.tolist()))

Out[91]: 

0     True

1    False

2     True

Name: A, dtype: bool

edited Dec 25 '17 at 21:06

answered Dec 25 '17 at 18:22

W-B

107k83265

edited Dec 25 '17 at 21:06

answered Dec 25 '17 at 18:22

W-B

107k83265

answered Dec 25 '17 at 18:22

W-B

107k83265

answered Dec 25 '17 at 18:22

W-B

107k83265

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Brtdku