Faster implementation of pandas apply function
I have a pandas dataFrame
in which I would like to check if one column is contained
in another.
Suppose:
df = DataFrame({'A': ['some text here', 'another text', 'and this'],
'B': ['some', 'somethin', 'this']})
I would like to check if df.B[0]
is in df.A[0]
, df.B[1]
is in df.A[1]
etc.
Current approach
I have the following apply
function implementation
df.apply(lambda x: x[1] in x[0], axis=1)
result is a Series
of [True, False, True]
which is fine, but for my dataFrame shape
(it is in the millions) it takes quite long.
Is there a better (i.e. faster) implamentation?
Unsuccesfull approach
I tried the pandas.Series.str.contains
approach, but it can only take a string for the pattern.
df['A'].str.contains(df['B'], regex=False)
python string pandas apply
add a comment |
I have a pandas dataFrame
in which I would like to check if one column is contained
in another.
Suppose:
df = DataFrame({'A': ['some text here', 'another text', 'and this'],
'B': ['some', 'somethin', 'this']})
I would like to check if df.B[0]
is in df.A[0]
, df.B[1]
is in df.A[1]
etc.
Current approach
I have the following apply
function implementation
df.apply(lambda x: x[1] in x[0], axis=1)
result is a Series
of [True, False, True]
which is fine, but for my dataFrame shape
(it is in the millions) it takes quite long.
Is there a better (i.e. faster) implamentation?
Unsuccesfull approach
I tried the pandas.Series.str.contains
approach, but it can only take a string for the pattern.
df['A'].str.contains(df['B'], regex=False)
python string pandas apply
add a comment |
I have a pandas dataFrame
in which I would like to check if one column is contained
in another.
Suppose:
df = DataFrame({'A': ['some text here', 'another text', 'and this'],
'B': ['some', 'somethin', 'this']})
I would like to check if df.B[0]
is in df.A[0]
, df.B[1]
is in df.A[1]
etc.
Current approach
I have the following apply
function implementation
df.apply(lambda x: x[1] in x[0], axis=1)
result is a Series
of [True, False, True]
which is fine, but for my dataFrame shape
(it is in the millions) it takes quite long.
Is there a better (i.e. faster) implamentation?
Unsuccesfull approach
I tried the pandas.Series.str.contains
approach, but it can only take a string for the pattern.
df['A'].str.contains(df['B'], regex=False)
python string pandas apply
I have a pandas dataFrame
in which I would like to check if one column is contained
in another.
Suppose:
df = DataFrame({'A': ['some text here', 'another text', 'and this'],
'B': ['some', 'somethin', 'this']})
I would like to check if df.B[0]
is in df.A[0]
, df.B[1]
is in df.A[1]
etc.
Current approach
I have the following apply
function implementation
df.apply(lambda x: x[1] in x[0], axis=1)
result is a Series
of [True, False, True]
which is fine, but for my dataFrame shape
(it is in the millions) it takes quite long.
Is there a better (i.e. faster) implamentation?
Unsuccesfull approach
I tried the pandas.Series.str.contains
approach, but it can only take a string for the pattern.
df['A'].str.contains(df['B'], regex=False)
python string pandas apply
python string pandas apply
edited Jan 18 at 23:50
coldspeed
127k23128214
127k23128214
asked Dec 25 '17 at 17:55
dimitris_psdimitris_ps
3,80911436
3,80911436
add a comment |
add a comment |
4 Answers
4
active
oldest
votes
Use np.vectorize
- bypasses the apply
overhead, so should be a bit faster.
v = np.vectorize(lambda x, y: y in x)
v(df.A, df.B)
array([ True, False, True], dtype=bool)
Here's a timings comparison -
df = pd.concat([df] * 10000)
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
1 loop, best of 3: 1.32 s per loop
%timeit v(df.A, df.B)
100 loops, best of 3: 5.55 ms per loop
# Psidom's answer
%timeit [b in a for a, b in zip(df.A, df.B)]
100 loops, best of 3: 3.34 ms per loop
Both are pretty competitive options!
Edit, adding timings for Wen's and Max's answers -
# Wen's answer
%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
10 loops, best of 3: 49.1 ms per loop
# MaxU's answer
%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
10 loops, best of 3: 87.8 ms per loop
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you passdf.*.values
instead ofdf.*
tov
.
– coldspeed
Dec 25 '17 at 18:12
Hi, can you test my speed :-)
– W-B
Dec 25 '17 at 18:22
1
@Wen Done! I don't know what it's doing, but I like it!
– coldspeed
Dec 25 '17 at 18:25
1
This is a small trick bynp.nan
infection :-) stackoverflow.com/questions/46944650/…
– W-B
Dec 25 '17 at 18:27
add a comment |
Try zip
, it's significantly faster then apply
in this case:
df = pd.concat([df] * 10000)
df.head()
# A B
#0 some text here some
#1 another text somethin
#2 and this this
#0 some text here some
#1 another text somethin
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
# 1 loop, best of 3: 697 ms per loop
%timeit [b in a for a, b in zip(df.A, df.B)]
# 100 loops, best of 3: 3.53 ms per loop
# @coldspeed's np.vectorize solution
%timeit v(df.A, df.B)
# 100 loops, best of 3: 4.18 ms per loop
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!
– dimitris_ps
Dec 25 '17 at 18:19
add a comment |
UPDATE: we can also try to use numba
:
from numba import jit
@jit
def check_b_in_a(a,b):
result = np.zeros(len(a)).astype('bool')
for i in range(len(a)):
t = b[i] in a[i]
if t:
result[i] = t
return result
In [100]: check_b_in_a(df.A.values, df.B.values)
Out[100]: array([ True, False, True], dtype=bool)
yet another vectorized solution:
In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
Out[50]:
0 True
1 False
2 True
dtype: bool
NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:
In [51]: df = pd.concat([df] * 10000)
# Psidom
In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]
7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# cᴏʟᴅsᴘᴇᴇᴅ
In [53]: %timeit v(df.A, df.B)
15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# MaxU (1)
In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# MaxU (2)
In [103]: %timeit check_b_in_a(df.A.values, df.B.values)
22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Wen
In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Look ma, no loops! I like this one too.
– coldspeed
Dec 25 '17 at 18:26
@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)
– MaxU
Dec 25 '17 at 18:28
2
Actually mine is slower, by a decade or two. Thnx
– dimitris_ps
Dec 25 '17 at 18:30
add a comment |
Using the replace
and nan
infection
df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
Out[84]:
0 True
1 False
2 True
Name: A, dtype: bool
To fix your code
df['A'].str.contains('|'.join(df.B.tolist()))
Out[91]:
0 True
1 False
2 True
Name: A, dtype: bool
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47970891%2ffaster-implementation-of-pandas-apply-function%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Use np.vectorize
- bypasses the apply
overhead, so should be a bit faster.
v = np.vectorize(lambda x, y: y in x)
v(df.A, df.B)
array([ True, False, True], dtype=bool)
Here's a timings comparison -
df = pd.concat([df] * 10000)
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
1 loop, best of 3: 1.32 s per loop
%timeit v(df.A, df.B)
100 loops, best of 3: 5.55 ms per loop
# Psidom's answer
%timeit [b in a for a, b in zip(df.A, df.B)]
100 loops, best of 3: 3.34 ms per loop
Both are pretty competitive options!
Edit, adding timings for Wen's and Max's answers -
# Wen's answer
%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
10 loops, best of 3: 49.1 ms per loop
# MaxU's answer
%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
10 loops, best of 3: 87.8 ms per loop
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you passdf.*.values
instead ofdf.*
tov
.
– coldspeed
Dec 25 '17 at 18:12
Hi, can you test my speed :-)
– W-B
Dec 25 '17 at 18:22
1
@Wen Done! I don't know what it's doing, but I like it!
– coldspeed
Dec 25 '17 at 18:25
1
This is a small trick bynp.nan
infection :-) stackoverflow.com/questions/46944650/…
– W-B
Dec 25 '17 at 18:27
add a comment |
Use np.vectorize
- bypasses the apply
overhead, so should be a bit faster.
v = np.vectorize(lambda x, y: y in x)
v(df.A, df.B)
array([ True, False, True], dtype=bool)
Here's a timings comparison -
df = pd.concat([df] * 10000)
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
1 loop, best of 3: 1.32 s per loop
%timeit v(df.A, df.B)
100 loops, best of 3: 5.55 ms per loop
# Psidom's answer
%timeit [b in a for a, b in zip(df.A, df.B)]
100 loops, best of 3: 3.34 ms per loop
Both are pretty competitive options!
Edit, adding timings for Wen's and Max's answers -
# Wen's answer
%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
10 loops, best of 3: 49.1 ms per loop
# MaxU's answer
%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
10 loops, best of 3: 87.8 ms per loop
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you passdf.*.values
instead ofdf.*
tov
.
– coldspeed
Dec 25 '17 at 18:12
Hi, can you test my speed :-)
– W-B
Dec 25 '17 at 18:22
1
@Wen Done! I don't know what it's doing, but I like it!
– coldspeed
Dec 25 '17 at 18:25
1
This is a small trick bynp.nan
infection :-) stackoverflow.com/questions/46944650/…
– W-B
Dec 25 '17 at 18:27
add a comment |
Use np.vectorize
- bypasses the apply
overhead, so should be a bit faster.
v = np.vectorize(lambda x, y: y in x)
v(df.A, df.B)
array([ True, False, True], dtype=bool)
Here's a timings comparison -
df = pd.concat([df] * 10000)
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
1 loop, best of 3: 1.32 s per loop
%timeit v(df.A, df.B)
100 loops, best of 3: 5.55 ms per loop
# Psidom's answer
%timeit [b in a for a, b in zip(df.A, df.B)]
100 loops, best of 3: 3.34 ms per loop
Both are pretty competitive options!
Edit, adding timings for Wen's and Max's answers -
# Wen's answer
%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
10 loops, best of 3: 49.1 ms per loop
# MaxU's answer
%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
10 loops, best of 3: 87.8 ms per loop
Use np.vectorize
- bypasses the apply
overhead, so should be a bit faster.
v = np.vectorize(lambda x, y: y in x)
v(df.A, df.B)
array([ True, False, True], dtype=bool)
Here's a timings comparison -
df = pd.concat([df] * 10000)
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
1 loop, best of 3: 1.32 s per loop
%timeit v(df.A, df.B)
100 loops, best of 3: 5.55 ms per loop
# Psidom's answer
%timeit [b in a for a, b in zip(df.A, df.B)]
100 loops, best of 3: 3.34 ms per loop
Both are pretty competitive options!
Edit, adding timings for Wen's and Max's answers -
# Wen's answer
%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
10 loops, best of 3: 49.1 ms per loop
# MaxU's answer
%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
10 loops, best of 3: 87.8 ms per loop
edited Dec 25 '17 at 18:24
answered Dec 25 '17 at 18:00
coldspeedcoldspeed
127k23128214
127k23128214
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you passdf.*.values
instead ofdf.*
tov
.
– coldspeed
Dec 25 '17 at 18:12
Hi, can you test my speed :-)
– W-B
Dec 25 '17 at 18:22
1
@Wen Done! I don't know what it's doing, but I like it!
– coldspeed
Dec 25 '17 at 18:25
1
This is a small trick bynp.nan
infection :-) stackoverflow.com/questions/46944650/…
– W-B
Dec 25 '17 at 18:27
add a comment |
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you passdf.*.values
instead ofdf.*
tov
.
– coldspeed
Dec 25 '17 at 18:12
Hi, can you test my speed :-)
– W-B
Dec 25 '17 at 18:22
1
@Wen Done! I don't know what it's doing, but I like it!
– coldspeed
Dec 25 '17 at 18:25
1
This is a small trick bynp.nan
infection :-) stackoverflow.com/questions/46944650/…
– W-B
Dec 25 '17 at 18:27
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass
df.*.values
instead of df.*
to v
.– coldspeed
Dec 25 '17 at 18:12
@dimitris_ps You're welcome. You actually get a few speed improvements if you 1) pass a user defined function instead of lambda, and 2) you pass
df.*.values
instead of df.*
to v
.– coldspeed
Dec 25 '17 at 18:12
Hi, can you test my speed :-)
– W-B
Dec 25 '17 at 18:22
Hi, can you test my speed :-)
– W-B
Dec 25 '17 at 18:22
1
1
@Wen Done! I don't know what it's doing, but I like it!
– coldspeed
Dec 25 '17 at 18:25
@Wen Done! I don't know what it's doing, but I like it!
– coldspeed
Dec 25 '17 at 18:25
1
1
This is a small trick by
np.nan
infection :-) stackoverflow.com/questions/46944650/…– W-B
Dec 25 '17 at 18:27
This is a small trick by
np.nan
infection :-) stackoverflow.com/questions/46944650/…– W-B
Dec 25 '17 at 18:27
add a comment |
Try zip
, it's significantly faster then apply
in this case:
df = pd.concat([df] * 10000)
df.head()
# A B
#0 some text here some
#1 another text somethin
#2 and this this
#0 some text here some
#1 another text somethin
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
# 1 loop, best of 3: 697 ms per loop
%timeit [b in a for a, b in zip(df.A, df.B)]
# 100 loops, best of 3: 3.53 ms per loop
# @coldspeed's np.vectorize solution
%timeit v(df.A, df.B)
# 100 loops, best of 3: 4.18 ms per loop
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!
– dimitris_ps
Dec 25 '17 at 18:19
add a comment |
Try zip
, it's significantly faster then apply
in this case:
df = pd.concat([df] * 10000)
df.head()
# A B
#0 some text here some
#1 another text somethin
#2 and this this
#0 some text here some
#1 another text somethin
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
# 1 loop, best of 3: 697 ms per loop
%timeit [b in a for a, b in zip(df.A, df.B)]
# 100 loops, best of 3: 3.53 ms per loop
# @coldspeed's np.vectorize solution
%timeit v(df.A, df.B)
# 100 loops, best of 3: 4.18 ms per loop
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!
– dimitris_ps
Dec 25 '17 at 18:19
add a comment |
Try zip
, it's significantly faster then apply
in this case:
df = pd.concat([df] * 10000)
df.head()
# A B
#0 some text here some
#1 another text somethin
#2 and this this
#0 some text here some
#1 another text somethin
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
# 1 loop, best of 3: 697 ms per loop
%timeit [b in a for a, b in zip(df.A, df.B)]
# 100 loops, best of 3: 3.53 ms per loop
# @coldspeed's np.vectorize solution
%timeit v(df.A, df.B)
# 100 loops, best of 3: 4.18 ms per loop
Try zip
, it's significantly faster then apply
in this case:
df = pd.concat([df] * 10000)
df.head()
# A B
#0 some text here some
#1 another text somethin
#2 and this this
#0 some text here some
#1 another text somethin
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
# 1 loop, best of 3: 697 ms per loop
%timeit [b in a for a, b in zip(df.A, df.B)]
# 100 loops, best of 3: 3.53 ms per loop
# @coldspeed's np.vectorize solution
%timeit v(df.A, df.B)
# 100 loops, best of 3: 4.18 ms per loop
answered Dec 25 '17 at 18:00
PsidomPsidom
123k1285127
123k1285127
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!
– dimitris_ps
Dec 25 '17 at 18:19
add a comment |
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!
– dimitris_ps
Dec 25 '17 at 18:19
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
This is great, thnx
– dimitris_ps
Dec 25 '17 at 18:07
I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!
– dimitris_ps
Dec 25 '17 at 18:19
I wish i could accept both answers, i will accept cᴏʟᴅsᴘᴇᴇᴅ just because he/she has lower rep. Thanks again!
– dimitris_ps
Dec 25 '17 at 18:19
add a comment |
UPDATE: we can also try to use numba
:
from numba import jit
@jit
def check_b_in_a(a,b):
result = np.zeros(len(a)).astype('bool')
for i in range(len(a)):
t = b[i] in a[i]
if t:
result[i] = t
return result
In [100]: check_b_in_a(df.A.values, df.B.values)
Out[100]: array([ True, False, True], dtype=bool)
yet another vectorized solution:
In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
Out[50]:
0 True
1 False
2 True
dtype: bool
NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:
In [51]: df = pd.concat([df] * 10000)
# Psidom
In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]
7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# cᴏʟᴅsᴘᴇᴇᴅ
In [53]: %timeit v(df.A, df.B)
15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# MaxU (1)
In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# MaxU (2)
In [103]: %timeit check_b_in_a(df.A.values, df.B.values)
22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Wen
In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Look ma, no loops! I like this one too.
– coldspeed
Dec 25 '17 at 18:26
@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)
– MaxU
Dec 25 '17 at 18:28
2
Actually mine is slower, by a decade or two. Thnx
– dimitris_ps
Dec 25 '17 at 18:30
add a comment |
UPDATE: we can also try to use numba
:
from numba import jit
@jit
def check_b_in_a(a,b):
result = np.zeros(len(a)).astype('bool')
for i in range(len(a)):
t = b[i] in a[i]
if t:
result[i] = t
return result
In [100]: check_b_in_a(df.A.values, df.B.values)
Out[100]: array([ True, False, True], dtype=bool)
yet another vectorized solution:
In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
Out[50]:
0 True
1 False
2 True
dtype: bool
NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:
In [51]: df = pd.concat([df] * 10000)
# Psidom
In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]
7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# cᴏʟᴅsᴘᴇᴇᴅ
In [53]: %timeit v(df.A, df.B)
15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# MaxU (1)
In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# MaxU (2)
In [103]: %timeit check_b_in_a(df.A.values, df.B.values)
22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Wen
In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Look ma, no loops! I like this one too.
– coldspeed
Dec 25 '17 at 18:26
@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)
– MaxU
Dec 25 '17 at 18:28
2
Actually mine is slower, by a decade or two. Thnx
– dimitris_ps
Dec 25 '17 at 18:30
add a comment |
UPDATE: we can also try to use numba
:
from numba import jit
@jit
def check_b_in_a(a,b):
result = np.zeros(len(a)).astype('bool')
for i in range(len(a)):
t = b[i] in a[i]
if t:
result[i] = t
return result
In [100]: check_b_in_a(df.A.values, df.B.values)
Out[100]: array([ True, False, True], dtype=bool)
yet another vectorized solution:
In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
Out[50]:
0 True
1 False
2 True
dtype: bool
NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:
In [51]: df = pd.concat([df] * 10000)
# Psidom
In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]
7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# cᴏʟᴅsᴘᴇᴇᴅ
In [53]: %timeit v(df.A, df.B)
15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# MaxU (1)
In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# MaxU (2)
In [103]: %timeit check_b_in_a(df.A.values, df.B.values)
22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Wen
In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
UPDATE: we can also try to use numba
:
from numba import jit
@jit
def check_b_in_a(a,b):
result = np.zeros(len(a)).astype('bool')
for i in range(len(a)):
t = b[i] in a[i]
if t:
result[i] = t
return result
In [100]: check_b_in_a(df.A.values, df.B.values)
Out[100]: array([ True, False, True], dtype=bool)
yet another vectorized solution:
In [50]: df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
Out[50]:
0 True
1 False
2 True
dtype: bool
NOTE: it's much slower compared to Psidom's and COLDSPEED's solutions:
In [51]: df = pd.concat([df] * 10000)
# Psidom
In [52]: %timeit [b in a for a, b in zip(df.A, df.B)]
7.45 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# cᴏʟᴅsᴘᴇᴇᴅ
In [53]: %timeit v(df.A, df.B)
15.4 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# MaxU (1)
In [54]: %timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
185 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# MaxU (2)
In [103]: %timeit check_b_in_a(df.A.values, df.B.values)
22.7 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Wen
In [104]: %timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
134 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
edited Dec 25 '17 at 19:59
answered Dec 25 '17 at 18:22
MaxUMaxU
121k12117169
121k12117169
Look ma, no loops! I like this one too.
– coldspeed
Dec 25 '17 at 18:26
@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)
– MaxU
Dec 25 '17 at 18:28
2
Actually mine is slower, by a decade or two. Thnx
– dimitris_ps
Dec 25 '17 at 18:30
add a comment |
Look ma, no loops! I like this one too.
– coldspeed
Dec 25 '17 at 18:26
@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)
– MaxU
Dec 25 '17 at 18:28
2
Actually mine is slower, by a decade or two. Thnx
– dimitris_ps
Dec 25 '17 at 18:30
Look ma, no loops! I like this one too.
– coldspeed
Dec 25 '17 at 18:26
Look ma, no loops! I like this one too.
– coldspeed
Dec 25 '17 at 18:26
@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)
– MaxU
Dec 25 '17 at 18:28
@cᴏʟᴅsᴘᴇᴇᴅ, well, it's the slowest one ;-)
– MaxU
Dec 25 '17 at 18:28
2
2
Actually mine is slower, by a decade or two. Thnx
– dimitris_ps
Dec 25 '17 at 18:30
Actually mine is slower, by a decade or two. Thnx
– dimitris_ps
Dec 25 '17 at 18:30
add a comment |
Using the replace
and nan
infection
df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
Out[84]:
0 True
1 False
2 True
Name: A, dtype: bool
To fix your code
df['A'].str.contains('|'.join(df.B.tolist()))
Out[91]:
0 True
1 False
2 True
Name: A, dtype: bool
add a comment |
Using the replace
and nan
infection
df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
Out[84]:
0 True
1 False
2 True
Name: A, dtype: bool
To fix your code
df['A'].str.contains('|'.join(df.B.tolist()))
Out[91]:
0 True
1 False
2 True
Name: A, dtype: bool
add a comment |
Using the replace
and nan
infection
df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
Out[84]:
0 True
1 False
2 True
Name: A, dtype: bool
To fix your code
df['A'].str.contains('|'.join(df.B.tolist()))
Out[91]:
0 True
1 False
2 True
Name: A, dtype: bool
Using the replace
and nan
infection
df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
Out[84]:
0 True
1 False
2 True
Name: A, dtype: bool
To fix your code
df['A'].str.contains('|'.join(df.B.tolist()))
Out[91]:
0 True
1 False
2 True
Name: A, dtype: bool
edited Dec 25 '17 at 21:06
answered Dec 25 '17 at 18:22
W-BW-B
107k83265
107k83265
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47970891%2ffaster-implementation-of-pandas-apply-function%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown