Beautiful Soup - Get arguments attributes which contains strings
Suppose we have a html
like below:
<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>
I want to extract the arguments on title
attribute if it contains Sports
:
So in the end we have a variable sports
:
sports = ['Football', 'Badminton', 'Ski Jump']
This is what i use:
sports = soup.find_all('span', {'title': 'Sports'})
I've got nothing
python html beautifulsoup
add a comment |
Suppose we have a html
like below:
<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>
I want to extract the arguments on title
attribute if it contains Sports
:
So in the end we have a variable sports
:
sports = ['Football', 'Badminton', 'Ski Jump']
This is what i use:
sports = soup.find_all('span', {'title': 'Sports'})
I've got nothing
python html beautifulsoup
add a comment |
Suppose we have a html
like below:
<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>
I want to extract the arguments on title
attribute if it contains Sports
:
So in the end we have a variable sports
:
sports = ['Football', 'Badminton', 'Ski Jump']
This is what i use:
sports = soup.find_all('span', {'title': 'Sports'})
I've got nothing
python html beautifulsoup
Suppose we have a html
like below:
<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>
I want to extract the arguments on title
attribute if it contains Sports
:
So in the end we have a variable sports
:
sports = ['Football', 'Badminton', 'Ski Jump']
This is what i use:
sports = soup.find_all('span', {'title': 'Sports'})
I've got nothing
python html beautifulsoup
python html beautifulsoup
asked Nov 23 '18 at 5:45
JON PANTAU
1488
1488
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
You can use re.compile
with BeautifulSoup
to find all span
tags if the first part of the title
attribute is "Sports"
:
content = """
<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>
"""
import re
from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser')
results = [i.text for i in d.find_all('span', {'title':re.compile('^Sportss')})]
Output:
['Football', 'Tennis', 'Ski Jump']
add a comment |
You are getting nothing because there is no fixed title just named Sports
and it does not work like a wildcard. If you want to get the attribute value of title
, you can use get(attr_name)
on your tag object that you get using find_all
.
from bs4 import BeautifulSoup
html = '''<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>'''
soup = BeautifulSoup(html,"lxml")
title = [s.get('title') for s in soup.find_all('span')]
title
>> ['Sports Football', 'Sports Badminton', 'Sports Ski Jump']
In addition to that, if you would only require the text for that element, just use the .text
method on the tag object from find_all
.
sports = [s.text for s in soup.find_all('span')]
sports
>>['Football', 'Tennis', 'Ski Jump']
title = [s.get('title') for s in soup.find_all('span') if re.findall('(?<![a-zA-Z])Sports(?![a-zA-Z])',s.get('title'))]
what about adding regular expressions ? i think it should be able to extract those containingSports
– Cua
Nov 23 '18 at 7:01
I think regex is overkill just to check if the wordSports
exists, by the way at that point is up to OP's intention on which elements he wants to extract.
– BernardL
Nov 23 '18 at 14:11
add a comment |
Maybe the example you gave was just made up off the top of your head but the contents of your spans match what you are looking for exactly - so in that example you could work around by going:
sports = soup.find_all('span', {'title': 'Sports'}).contents
and that will give you the string versions of what you're looking for.
1
That will fail,soup.find_all
returns a list, not a tag object.
– BernardL
Nov 23 '18 at 6:06
A list comprehension like[i.text for i in sports]
might give proper solution if yourfind_all
works.
– SmashGuy
Nov 23 '18 at 6:08
Yeah sorry i was leaving the reader to turn the list into the list he wants at the top - should've been more clear
– Matthew Sciamanna
Nov 23 '18 at 23:43
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441208%2fbeautiful-soup-get-arguments-attributes-which-contains-strings%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can use re.compile
with BeautifulSoup
to find all span
tags if the first part of the title
attribute is "Sports"
:
content = """
<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>
"""
import re
from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser')
results = [i.text for i in d.find_all('span', {'title':re.compile('^Sportss')})]
Output:
['Football', 'Tennis', 'Ski Jump']
add a comment |
You can use re.compile
with BeautifulSoup
to find all span
tags if the first part of the title
attribute is "Sports"
:
content = """
<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>
"""
import re
from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser')
results = [i.text for i in d.find_all('span', {'title':re.compile('^Sportss')})]
Output:
['Football', 'Tennis', 'Ski Jump']
add a comment |
You can use re.compile
with BeautifulSoup
to find all span
tags if the first part of the title
attribute is "Sports"
:
content = """
<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>
"""
import re
from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser')
results = [i.text for i in d.find_all('span', {'title':re.compile('^Sportss')})]
Output:
['Football', 'Tennis', 'Ski Jump']
You can use re.compile
with BeautifulSoup
to find all span
tags if the first part of the title
attribute is "Sports"
:
content = """
<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>
"""
import re
from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser')
results = [i.text for i in d.find_all('span', {'title':re.compile('^Sportss')})]
Output:
['Football', 'Tennis', 'Ski Jump']
answered Nov 23 '18 at 16:06
Ajax1234
40.4k42653
40.4k42653
add a comment |
add a comment |
You are getting nothing because there is no fixed title just named Sports
and it does not work like a wildcard. If you want to get the attribute value of title
, you can use get(attr_name)
on your tag object that you get using find_all
.
from bs4 import BeautifulSoup
html = '''<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>'''
soup = BeautifulSoup(html,"lxml")
title = [s.get('title') for s in soup.find_all('span')]
title
>> ['Sports Football', 'Sports Badminton', 'Sports Ski Jump']
In addition to that, if you would only require the text for that element, just use the .text
method on the tag object from find_all
.
sports = [s.text for s in soup.find_all('span')]
sports
>>['Football', 'Tennis', 'Ski Jump']
title = [s.get('title') for s in soup.find_all('span') if re.findall('(?<![a-zA-Z])Sports(?![a-zA-Z])',s.get('title'))]
what about adding regular expressions ? i think it should be able to extract those containingSports
– Cua
Nov 23 '18 at 7:01
I think regex is overkill just to check if the wordSports
exists, by the way at that point is up to OP's intention on which elements he wants to extract.
– BernardL
Nov 23 '18 at 14:11
add a comment |
You are getting nothing because there is no fixed title just named Sports
and it does not work like a wildcard. If you want to get the attribute value of title
, you can use get(attr_name)
on your tag object that you get using find_all
.
from bs4 import BeautifulSoup
html = '''<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>'''
soup = BeautifulSoup(html,"lxml")
title = [s.get('title') for s in soup.find_all('span')]
title
>> ['Sports Football', 'Sports Badminton', 'Sports Ski Jump']
In addition to that, if you would only require the text for that element, just use the .text
method on the tag object from find_all
.
sports = [s.text for s in soup.find_all('span')]
sports
>>['Football', 'Tennis', 'Ski Jump']
title = [s.get('title') for s in soup.find_all('span') if re.findall('(?<![a-zA-Z])Sports(?![a-zA-Z])',s.get('title'))]
what about adding regular expressions ? i think it should be able to extract those containingSports
– Cua
Nov 23 '18 at 7:01
I think regex is overkill just to check if the wordSports
exists, by the way at that point is up to OP's intention on which elements he wants to extract.
– BernardL
Nov 23 '18 at 14:11
add a comment |
You are getting nothing because there is no fixed title just named Sports
and it does not work like a wildcard. If you want to get the attribute value of title
, you can use get(attr_name)
on your tag object that you get using find_all
.
from bs4 import BeautifulSoup
html = '''<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>'''
soup = BeautifulSoup(html,"lxml")
title = [s.get('title') for s in soup.find_all('span')]
title
>> ['Sports Football', 'Sports Badminton', 'Sports Ski Jump']
In addition to that, if you would only require the text for that element, just use the .text
method on the tag object from find_all
.
sports = [s.text for s in soup.find_all('span')]
sports
>>['Football', 'Tennis', 'Ski Jump']
You are getting nothing because there is no fixed title just named Sports
and it does not work like a wildcard. If you want to get the attribute value of title
, you can use get(attr_name)
on your tag object that you get using find_all
.
from bs4 import BeautifulSoup
html = '''<span title="Sports Football">Football</span>
<span title="Sports Badminton">Tennis</span>
<span title="Sports Ski Jump">Ski Jump</span>'''
soup = BeautifulSoup(html,"lxml")
title = [s.get('title') for s in soup.find_all('span')]
title
>> ['Sports Football', 'Sports Badminton', 'Sports Ski Jump']
In addition to that, if you would only require the text for that element, just use the .text
method on the tag object from find_all
.
sports = [s.text for s in soup.find_all('span')]
sports
>>['Football', 'Tennis', 'Ski Jump']
edited Nov 23 '18 at 6:08
answered Nov 23 '18 at 6:03
BernardL
2,3381929
2,3381929
title = [s.get('title') for s in soup.find_all('span') if re.findall('(?<![a-zA-Z])Sports(?![a-zA-Z])',s.get('title'))]
what about adding regular expressions ? i think it should be able to extract those containingSports
– Cua
Nov 23 '18 at 7:01
I think regex is overkill just to check if the wordSports
exists, by the way at that point is up to OP's intention on which elements he wants to extract.
– BernardL
Nov 23 '18 at 14:11
add a comment |
title = [s.get('title') for s in soup.find_all('span') if re.findall('(?<![a-zA-Z])Sports(?![a-zA-Z])',s.get('title'))]
what about adding regular expressions ? i think it should be able to extract those containingSports
– Cua
Nov 23 '18 at 7:01
I think regex is overkill just to check if the wordSports
exists, by the way at that point is up to OP's intention on which elements he wants to extract.
– BernardL
Nov 23 '18 at 14:11
title = [s.get('title') for s in soup.find_all('span') if re.findall('(?<![a-zA-Z])Sports(?![a-zA-Z])',s.get('title'))]
what about adding regular expressions ? i think it should be able to extract those containing Sports
– Cua
Nov 23 '18 at 7:01
title = [s.get('title') for s in soup.find_all('span') if re.findall('(?<![a-zA-Z])Sports(?![a-zA-Z])',s.get('title'))]
what about adding regular expressions ? i think it should be able to extract those containing Sports
– Cua
Nov 23 '18 at 7:01
I think regex is overkill just to check if the word
Sports
exists, by the way at that point is up to OP's intention on which elements he wants to extract.– BernardL
Nov 23 '18 at 14:11
I think regex is overkill just to check if the word
Sports
exists, by the way at that point is up to OP's intention on which elements he wants to extract.– BernardL
Nov 23 '18 at 14:11
add a comment |
Maybe the example you gave was just made up off the top of your head but the contents of your spans match what you are looking for exactly - so in that example you could work around by going:
sports = soup.find_all('span', {'title': 'Sports'}).contents
and that will give you the string versions of what you're looking for.
1
That will fail,soup.find_all
returns a list, not a tag object.
– BernardL
Nov 23 '18 at 6:06
A list comprehension like[i.text for i in sports]
might give proper solution if yourfind_all
works.
– SmashGuy
Nov 23 '18 at 6:08
Yeah sorry i was leaving the reader to turn the list into the list he wants at the top - should've been more clear
– Matthew Sciamanna
Nov 23 '18 at 23:43
add a comment |
Maybe the example you gave was just made up off the top of your head but the contents of your spans match what you are looking for exactly - so in that example you could work around by going:
sports = soup.find_all('span', {'title': 'Sports'}).contents
and that will give you the string versions of what you're looking for.
1
That will fail,soup.find_all
returns a list, not a tag object.
– BernardL
Nov 23 '18 at 6:06
A list comprehension like[i.text for i in sports]
might give proper solution if yourfind_all
works.
– SmashGuy
Nov 23 '18 at 6:08
Yeah sorry i was leaving the reader to turn the list into the list he wants at the top - should've been more clear
– Matthew Sciamanna
Nov 23 '18 at 23:43
add a comment |
Maybe the example you gave was just made up off the top of your head but the contents of your spans match what you are looking for exactly - so in that example you could work around by going:
sports = soup.find_all('span', {'title': 'Sports'}).contents
and that will give you the string versions of what you're looking for.
Maybe the example you gave was just made up off the top of your head but the contents of your spans match what you are looking for exactly - so in that example you could work around by going:
sports = soup.find_all('span', {'title': 'Sports'}).contents
and that will give you the string versions of what you're looking for.
edited Nov 23 '18 at 6:55
SmashGuy
1,0471613
1,0471613
answered Nov 23 '18 at 6:03
Matthew Sciamanna
214
214
1
That will fail,soup.find_all
returns a list, not a tag object.
– BernardL
Nov 23 '18 at 6:06
A list comprehension like[i.text for i in sports]
might give proper solution if yourfind_all
works.
– SmashGuy
Nov 23 '18 at 6:08
Yeah sorry i was leaving the reader to turn the list into the list he wants at the top - should've been more clear
– Matthew Sciamanna
Nov 23 '18 at 23:43
add a comment |
1
That will fail,soup.find_all
returns a list, not a tag object.
– BernardL
Nov 23 '18 at 6:06
A list comprehension like[i.text for i in sports]
might give proper solution if yourfind_all
works.
– SmashGuy
Nov 23 '18 at 6:08
Yeah sorry i was leaving the reader to turn the list into the list he wants at the top - should've been more clear
– Matthew Sciamanna
Nov 23 '18 at 23:43
1
1
That will fail,
soup.find_all
returns a list, not a tag object.– BernardL
Nov 23 '18 at 6:06
That will fail,
soup.find_all
returns a list, not a tag object.– BernardL
Nov 23 '18 at 6:06
A list comprehension like
[i.text for i in sports]
might give proper solution if your find_all
works.– SmashGuy
Nov 23 '18 at 6:08
A list comprehension like
[i.text for i in sports]
might give proper solution if your find_all
works.– SmashGuy
Nov 23 '18 at 6:08
Yeah sorry i was leaving the reader to turn the list into the list he wants at the top - should've been more clear
– Matthew Sciamanna
Nov 23 '18 at 23:43
Yeah sorry i was leaving the reader to turn the list into the list he wants at the top - should've been more clear
– Matthew Sciamanna
Nov 23 '18 at 23:43
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441208%2fbeautiful-soup-get-arguments-attributes-which-contains-strings%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown