How to find combination of tag and text in BeautifulSoup











up vote
0
down vote

favorite












I've scraped HTMl from a website and need to obtain a particular tag within it, problem is, it is formatted in a confusing way and I cannot obtain the entire tag. Let me illustrate:



data = """
<div class="Answer">
1. BOUNDARIES - EPB &amp; APL&nbsp;<i>(inferior)</i>, EPL&nbsp;<i>(superior).&nbsp;</i><div>2. FLOOR (proximal to distal) - radial styloid =&gt; scaphoid =&gt; trapezium =&gt; 1st MC base.&nbsp;<br /><div>3. CONTENTS - cutaneous branches of radial nerve&nbsp;<i>(on the roof),</i>&nbsp;cephalic vein&nbsp;<i>(begins here),</i>&nbsp;&nbsp;radial artery&nbsp;<i>(on the floor).</i></div></div><div><br /></div><div><img src="paste-27a44c801f0776d91f5f6a16a963bff67f0e8ef3.jpg" /><br /></div><div><b>Image:&nbsp;</b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>
</div>
"""


From the above, I wish to obtain only this:



<div><b>Image:&nbsp;</b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>


I wrote the following code:



soup = BeautifulSoup(data, "html.parser")
image_link = soup.find('div').find('b').next.next
print(image_link)


But it only gets me the text:



Case courtesy of Dr Sachintha Hapugoda, <a href="https://radiopaedia.org/">Radiopaedia.org</a>. From the case <a href="https://radiopaedia.org/cases/52525">rID: 52525</a> [Accessed 15 Nov. 2018].


How do I get the entire tag?










share|improve this question


























    up vote
    0
    down vote

    favorite












    I've scraped HTMl from a website and need to obtain a particular tag within it, problem is, it is formatted in a confusing way and I cannot obtain the entire tag. Let me illustrate:



    data = """
    <div class="Answer">
    1. BOUNDARIES - EPB &amp; APL&nbsp;<i>(inferior)</i>, EPL&nbsp;<i>(superior).&nbsp;</i><div>2. FLOOR (proximal to distal) - radial styloid =&gt; scaphoid =&gt; trapezium =&gt; 1st MC base.&nbsp;<br /><div>3. CONTENTS - cutaneous branches of radial nerve&nbsp;<i>(on the roof),</i>&nbsp;cephalic vein&nbsp;<i>(begins here),</i>&nbsp;&nbsp;radial artery&nbsp;<i>(on the floor).</i></div></div><div><br /></div><div><img src="paste-27a44c801f0776d91f5f6a16a963bff67f0e8ef3.jpg" /><br /></div><div><b>Image:&nbsp;</b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>
    </div>
    """


    From the above, I wish to obtain only this:



    <div><b>Image:&nbsp;</b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>


    I wrote the following code:



    soup = BeautifulSoup(data, "html.parser")
    image_link = soup.find('div').find('b').next.next
    print(image_link)


    But it only gets me the text:



    Case courtesy of Dr Sachintha Hapugoda, <a href="https://radiopaedia.org/">Radiopaedia.org</a>. From the case <a href="https://radiopaedia.org/cases/52525">rID: 52525</a> [Accessed 15 Nov. 2018].


    How do I get the entire tag?










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I've scraped HTMl from a website and need to obtain a particular tag within it, problem is, it is formatted in a confusing way and I cannot obtain the entire tag. Let me illustrate:



      data = """
      <div class="Answer">
      1. BOUNDARIES - EPB &amp; APL&nbsp;<i>(inferior)</i>, EPL&nbsp;<i>(superior).&nbsp;</i><div>2. FLOOR (proximal to distal) - radial styloid =&gt; scaphoid =&gt; trapezium =&gt; 1st MC base.&nbsp;<br /><div>3. CONTENTS - cutaneous branches of radial nerve&nbsp;<i>(on the roof),</i>&nbsp;cephalic vein&nbsp;<i>(begins here),</i>&nbsp;&nbsp;radial artery&nbsp;<i>(on the floor).</i></div></div><div><br /></div><div><img src="paste-27a44c801f0776d91f5f6a16a963bff67f0e8ef3.jpg" /><br /></div><div><b>Image:&nbsp;</b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>
      </div>
      """


      From the above, I wish to obtain only this:



      <div><b>Image:&nbsp;</b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>


      I wrote the following code:



      soup = BeautifulSoup(data, "html.parser")
      image_link = soup.find('div').find('b').next.next
      print(image_link)


      But it only gets me the text:



      Case courtesy of Dr Sachintha Hapugoda, <a href="https://radiopaedia.org/">Radiopaedia.org</a>. From the case <a href="https://radiopaedia.org/cases/52525">rID: 52525</a> [Accessed 15 Nov. 2018].


      How do I get the entire tag?










      share|improve this question













      I've scraped HTMl from a website and need to obtain a particular tag within it, problem is, it is formatted in a confusing way and I cannot obtain the entire tag. Let me illustrate:



      data = """
      <div class="Answer">
      1. BOUNDARIES - EPB &amp; APL&nbsp;<i>(inferior)</i>, EPL&nbsp;<i>(superior).&nbsp;</i><div>2. FLOOR (proximal to distal) - radial styloid =&gt; scaphoid =&gt; trapezium =&gt; 1st MC base.&nbsp;<br /><div>3. CONTENTS - cutaneous branches of radial nerve&nbsp;<i>(on the roof),</i>&nbsp;cephalic vein&nbsp;<i>(begins here),</i>&nbsp;&nbsp;radial artery&nbsp;<i>(on the floor).</i></div></div><div><br /></div><div><img src="paste-27a44c801f0776d91f5f6a16a963bff67f0e8ef3.jpg" /><br /></div><div><b>Image:&nbsp;</b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>
      </div>
      """


      From the above, I wish to obtain only this:



      <div><b>Image:&nbsp;</b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>


      I wrote the following code:



      soup = BeautifulSoup(data, "html.parser")
      image_link = soup.find('div').find('b').next.next
      print(image_link)


      But it only gets me the text:



      Case courtesy of Dr Sachintha Hapugoda, <a href="https://radiopaedia.org/">Radiopaedia.org</a>. From the case <a href="https://radiopaedia.org/cases/52525">rID: 52525</a> [Accessed 15 Nov. 2018].


      How do I get the entire tag?







      python beautifulsoup






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 21 at 18:32









      Code Monkey

      143110




      143110
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          -1
          down vote



          accepted










          Maybe try:



          image_link = soup.find('div').find('img').next.next


          Output:



          <div><b>Image: </b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>





          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53418476%2fhow-to-find-combination-of-tag-and-text-in-beautifulsoup%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            -1
            down vote



            accepted










            Maybe try:



            image_link = soup.find('div').find('img').next.next


            Output:



            <div><b>Image: </b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>





            share|improve this answer

























              up vote
              -1
              down vote



              accepted










              Maybe try:



              image_link = soup.find('div').find('img').next.next


              Output:



              <div><b>Image: </b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>





              share|improve this answer























                up vote
                -1
                down vote



                accepted







                up vote
                -1
                down vote



                accepted






                Maybe try:



                image_link = soup.find('div').find('img').next.next


                Output:



                <div><b>Image: </b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>





                share|improve this answer












                Maybe try:



                image_link = soup.find('div').find('img').next.next


                Output:



                <div><b>Image: </b>Case courtesy of Dr Sachintha Hapugoda, &lt;a href="https://radiopaedia.org/"&gt;Radiopaedia.org&lt;/a&gt;. From the case &lt;a href="https://radiopaedia.org/cases/52525"&gt;rID: 52525&lt;/a&gt; [Accessed 15 Nov. 2018].</div>






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 21 at 18:40









                l'L'l

                29.4k54891




                29.4k54891






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53418476%2fhow-to-find-combination-of-tag-and-text-in-beautifulsoup%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Berounka

                    Different font size/position of beamer's navigation symbols template's content depending on regular/plain...

                    Sphinx de Gizeh