Extracting data from XML tree into pandas/csv with Python
I have an issue with some XML files.
I cannot say a lot about data, because it is for work and I don't want to be in trouble! From a huge XML file, 123091 lines of code, I only need data from 7 tags(if that makes sense).
I am trying to extract that specific data, but I am having a bit of a situation when trying to store into pandas or csv.
I have found a method to take some information out, like:
for info in root.iter('ArtistName'):
print(info.text)
The code above will give me the artists in the data from that XML tag. Here is a little part of my Jupyter Notebook, with the output of the above lines of code:
Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Cream
Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Fleetwood Mac
Fleetwood Mac
Linkin Park
Lynyrd Skynyrd
Fleetwood Mac
Eric Clapton
The Black Keys
Tegan And Sara
And then, I have run into the problem, because in the below code, I cannot or better said, I don't know how to loop over each tag from XML to extract the data. Below is an attempt:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("filename.xml")
root = tree.getroot()
dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
df_xml = pd.DataFrame(columns = dfcols)
for i in root.iter(tree):
df_xml = df_xml.append(pd.Series(index=dfcols), ignore_index=True)
df_xml.head()
The result of the above code is:
IRC IRC2 Artist Song Units PPD TerritoryCode
Which is the header of the file that I want to create.
I cannot find a way to bring the information I need into these columns.
I have also tried this:
def getValOfNode(node):
return node.text if node is not None else None
def main():
dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
df_xml = pd.DataFrame(columns = dfcols)
for node in tree:
IRC = node.find('IRC')
IRC2 = node.find('ICPN')
Artist = node.find('rtistName')
Song = node.find('Title')
Units = node.find('ConsumerSales')
PPD = node.find('Amount')
TerritoryCode = node.find('TerritoryCode')
df_xml = df_xml.append(
pd.Series([getValOfNode(IRC), getValOfNode(IRC2), getValOfNode(Artist), getValOfNode(Song), getValOfNode(Units), getValOfNode(PPD), getValOfNode(TerritoryCode)], index=dfcols), ignore_index=True)
print(df_xml)
main()
And I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-1f816143f9e4> in <module>()
23
24
---> 25 main()
<ipython-input-5-1f816143f9e4> in main()
8 df_xml = pd.DataFrame(columns = dfcols)
9
---> 10 for node in tree:
11 IRC = node.find('IRC')
12 IRC2 = node.find('ICPN')
TypeError: 'ElementTree' object is not iterable
There is also an issue with the territory code, when I run:
for info in root.iter('TerritoryCode'):
print(info.text)
it prints the territories but, in order, because they are duplicates(I don't know how to explain), I really need all of them and not just one of each. If that makes sense.
This is what I get:
AE
AR
AT
AU
AW
BE
BG
BO
BR
BY
CA
CH
CL
CN
CO
CR
CY
CZ
DE
DK
DO
DZ
EC
EE
EG
ES
FI
FR
GB
GL
GR
GT
HK
HN
This is what I need:
AD
AD
AE
AE
AE
AE
AE
AE,
and so forth.
Can anyone help me with this?
Much appreciated.
Have a great day :)
python xml pandas export-to-csv
|
show 2 more comments
I have an issue with some XML files.
I cannot say a lot about data, because it is for work and I don't want to be in trouble! From a huge XML file, 123091 lines of code, I only need data from 7 tags(if that makes sense).
I am trying to extract that specific data, but I am having a bit of a situation when trying to store into pandas or csv.
I have found a method to take some information out, like:
for info in root.iter('ArtistName'):
print(info.text)
The code above will give me the artists in the data from that XML tag. Here is a little part of my Jupyter Notebook, with the output of the above lines of code:
Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Cream
Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Fleetwood Mac
Fleetwood Mac
Linkin Park
Lynyrd Skynyrd
Fleetwood Mac
Eric Clapton
The Black Keys
Tegan And Sara
And then, I have run into the problem, because in the below code, I cannot or better said, I don't know how to loop over each tag from XML to extract the data. Below is an attempt:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("filename.xml")
root = tree.getroot()
dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
df_xml = pd.DataFrame(columns = dfcols)
for i in root.iter(tree):
df_xml = df_xml.append(pd.Series(index=dfcols), ignore_index=True)
df_xml.head()
The result of the above code is:
IRC IRC2 Artist Song Units PPD TerritoryCode
Which is the header of the file that I want to create.
I cannot find a way to bring the information I need into these columns.
I have also tried this:
def getValOfNode(node):
return node.text if node is not None else None
def main():
dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
df_xml = pd.DataFrame(columns = dfcols)
for node in tree:
IRC = node.find('IRC')
IRC2 = node.find('ICPN')
Artist = node.find('rtistName')
Song = node.find('Title')
Units = node.find('ConsumerSales')
PPD = node.find('Amount')
TerritoryCode = node.find('TerritoryCode')
df_xml = df_xml.append(
pd.Series([getValOfNode(IRC), getValOfNode(IRC2), getValOfNode(Artist), getValOfNode(Song), getValOfNode(Units), getValOfNode(PPD), getValOfNode(TerritoryCode)], index=dfcols), ignore_index=True)
print(df_xml)
main()
And I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-1f816143f9e4> in <module>()
23
24
---> 25 main()
<ipython-input-5-1f816143f9e4> in main()
8 df_xml = pd.DataFrame(columns = dfcols)
9
---> 10 for node in tree:
11 IRC = node.find('IRC')
12 IRC2 = node.find('ICPN')
TypeError: 'ElementTree' object is not iterable
There is also an issue with the territory code, when I run:
for info in root.iter('TerritoryCode'):
print(info.text)
it prints the territories but, in order, because they are duplicates(I don't know how to explain), I really need all of them and not just one of each. If that makes sense.
This is what I get:
AE
AR
AT
AU
AW
BE
BG
BO
BR
BY
CA
CH
CL
CN
CO
CR
CY
CZ
DE
DK
DO
DZ
EC
EE
EG
ES
FI
FR
GB
GL
GR
GT
HK
HN
This is what I need:
AD
AD
AE
AE
AE
AE
AE
AE,
and so forth.
Can anyone help me with this?
Much appreciated.
Have a great day :)
python xml pandas export-to-csv
Can you provide at least a dummy example of XML as we need to see structure to help? By the way: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 22 at 18:47
Hi @Parfait, I will provide this on Monday as I am back at work. Thank you :)
– I Russu
Nov 25 at 1:16
@Parfait , I have edited the post with XML code (it is just one part of the file), the entire file looks like this. I said what I need - in the code, hope it helps! :) Thank you
– I Russu
Nov 26 at 8:46
Thanks for being so nice, and here is a link gist.github.com/ruioana/99b666f5192f47d0f016bb9154ca536b , I hope it makes sense now.
– I Russu
Nov 27 at 15:26
Thanks for post. However, your XML is not valid as the namespace prefix in root dsr is not defined.
– Parfait
Nov 27 at 15:30
|
show 2 more comments
I have an issue with some XML files.
I cannot say a lot about data, because it is for work and I don't want to be in trouble! From a huge XML file, 123091 lines of code, I only need data from 7 tags(if that makes sense).
I am trying to extract that specific data, but I am having a bit of a situation when trying to store into pandas or csv.
I have found a method to take some information out, like:
for info in root.iter('ArtistName'):
print(info.text)
The code above will give me the artists in the data from that XML tag. Here is a little part of my Jupyter Notebook, with the output of the above lines of code:
Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Cream
Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Fleetwood Mac
Fleetwood Mac
Linkin Park
Lynyrd Skynyrd
Fleetwood Mac
Eric Clapton
The Black Keys
Tegan And Sara
And then, I have run into the problem, because in the below code, I cannot or better said, I don't know how to loop over each tag from XML to extract the data. Below is an attempt:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("filename.xml")
root = tree.getroot()
dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
df_xml = pd.DataFrame(columns = dfcols)
for i in root.iter(tree):
df_xml = df_xml.append(pd.Series(index=dfcols), ignore_index=True)
df_xml.head()
The result of the above code is:
IRC IRC2 Artist Song Units PPD TerritoryCode
Which is the header of the file that I want to create.
I cannot find a way to bring the information I need into these columns.
I have also tried this:
def getValOfNode(node):
return node.text if node is not None else None
def main():
dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
df_xml = pd.DataFrame(columns = dfcols)
for node in tree:
IRC = node.find('IRC')
IRC2 = node.find('ICPN')
Artist = node.find('rtistName')
Song = node.find('Title')
Units = node.find('ConsumerSales')
PPD = node.find('Amount')
TerritoryCode = node.find('TerritoryCode')
df_xml = df_xml.append(
pd.Series([getValOfNode(IRC), getValOfNode(IRC2), getValOfNode(Artist), getValOfNode(Song), getValOfNode(Units), getValOfNode(PPD), getValOfNode(TerritoryCode)], index=dfcols), ignore_index=True)
print(df_xml)
main()
And I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-1f816143f9e4> in <module>()
23
24
---> 25 main()
<ipython-input-5-1f816143f9e4> in main()
8 df_xml = pd.DataFrame(columns = dfcols)
9
---> 10 for node in tree:
11 IRC = node.find('IRC')
12 IRC2 = node.find('ICPN')
TypeError: 'ElementTree' object is not iterable
There is also an issue with the territory code, when I run:
for info in root.iter('TerritoryCode'):
print(info.text)
it prints the territories but, in order, because they are duplicates(I don't know how to explain), I really need all of them and not just one of each. If that makes sense.
This is what I get:
AE
AR
AT
AU
AW
BE
BG
BO
BR
BY
CA
CH
CL
CN
CO
CR
CY
CZ
DE
DK
DO
DZ
EC
EE
EG
ES
FI
FR
GB
GL
GR
GT
HK
HN
This is what I need:
AD
AD
AE
AE
AE
AE
AE
AE,
and so forth.
Can anyone help me with this?
Much appreciated.
Have a great day :)
python xml pandas export-to-csv
I have an issue with some XML files.
I cannot say a lot about data, because it is for work and I don't want to be in trouble! From a huge XML file, 123091 lines of code, I only need data from 7 tags(if that makes sense).
I am trying to extract that specific data, but I am having a bit of a situation when trying to store into pandas or csv.
I have found a method to take some information out, like:
for info in root.iter('ArtistName'):
print(info.text)
The code above will give me the artists in the data from that XML tag. Here is a little part of my Jupyter Notebook, with the output of the above lines of code:
Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Cream
Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Fleetwood Mac
Fleetwood Mac
Linkin Park
Lynyrd Skynyrd
Fleetwood Mac
Eric Clapton
The Black Keys
Tegan And Sara
And then, I have run into the problem, because in the below code, I cannot or better said, I don't know how to loop over each tag from XML to extract the data. Below is an attempt:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("filename.xml")
root = tree.getroot()
dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
df_xml = pd.DataFrame(columns = dfcols)
for i in root.iter(tree):
df_xml = df_xml.append(pd.Series(index=dfcols), ignore_index=True)
df_xml.head()
The result of the above code is:
IRC IRC2 Artist Song Units PPD TerritoryCode
Which is the header of the file that I want to create.
I cannot find a way to bring the information I need into these columns.
I have also tried this:
def getValOfNode(node):
return node.text if node is not None else None
def main():
dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
df_xml = pd.DataFrame(columns = dfcols)
for node in tree:
IRC = node.find('IRC')
IRC2 = node.find('ICPN')
Artist = node.find('rtistName')
Song = node.find('Title')
Units = node.find('ConsumerSales')
PPD = node.find('Amount')
TerritoryCode = node.find('TerritoryCode')
df_xml = df_xml.append(
pd.Series([getValOfNode(IRC), getValOfNode(IRC2), getValOfNode(Artist), getValOfNode(Song), getValOfNode(Units), getValOfNode(PPD), getValOfNode(TerritoryCode)], index=dfcols), ignore_index=True)
print(df_xml)
main()
And I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-1f816143f9e4> in <module>()
23
24
---> 25 main()
<ipython-input-5-1f816143f9e4> in main()
8 df_xml = pd.DataFrame(columns = dfcols)
9
---> 10 for node in tree:
11 IRC = node.find('IRC')
12 IRC2 = node.find('ICPN')
TypeError: 'ElementTree' object is not iterable
There is also an issue with the territory code, when I run:
for info in root.iter('TerritoryCode'):
print(info.text)
it prints the territories but, in order, because they are duplicates(I don't know how to explain), I really need all of them and not just one of each. If that makes sense.
This is what I get:
AE
AR
AT
AU
AW
BE
BG
BO
BR
BY
CA
CH
CL
CN
CO
CR
CY
CZ
DE
DK
DO
DZ
EC
EE
EG
ES
FI
FR
GB
GL
GR
GT
HK
HN
This is what I need:
AD
AD
AE
AE
AE
AE
AE
AE,
and so forth.
Can anyone help me with this?
Much appreciated.
Have a great day :)
python xml pandas export-to-csv
python xml pandas export-to-csv
edited Nov 28 at 15:44
asked Nov 22 at 9:40
I Russu
516
516
Can you provide at least a dummy example of XML as we need to see structure to help? By the way: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 22 at 18:47
Hi @Parfait, I will provide this on Monday as I am back at work. Thank you :)
– I Russu
Nov 25 at 1:16
@Parfait , I have edited the post with XML code (it is just one part of the file), the entire file looks like this. I said what I need - in the code, hope it helps! :) Thank you
– I Russu
Nov 26 at 8:46
Thanks for being so nice, and here is a link gist.github.com/ruioana/99b666f5192f47d0f016bb9154ca536b , I hope it makes sense now.
– I Russu
Nov 27 at 15:26
Thanks for post. However, your XML is not valid as the namespace prefix in root dsr is not defined.
– Parfait
Nov 27 at 15:30
|
show 2 more comments
Can you provide at least a dummy example of XML as we need to see structure to help? By the way: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 22 at 18:47
Hi @Parfait, I will provide this on Monday as I am back at work. Thank you :)
– I Russu
Nov 25 at 1:16
@Parfait , I have edited the post with XML code (it is just one part of the file), the entire file looks like this. I said what I need - in the code, hope it helps! :) Thank you
– I Russu
Nov 26 at 8:46
Thanks for being so nice, and here is a link gist.github.com/ruioana/99b666f5192f47d0f016bb9154ca536b , I hope it makes sense now.
– I Russu
Nov 27 at 15:26
Thanks for post. However, your XML is not valid as the namespace prefix in root dsr is not defined.
– Parfait
Nov 27 at 15:30
Can you provide at least a dummy example of XML as we need to see structure to help? By the way: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 22 at 18:47
Can you provide at least a dummy example of XML as we need to see structure to help? By the way: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 22 at 18:47
Hi @Parfait, I will provide this on Monday as I am back at work. Thank you :)
– I Russu
Nov 25 at 1:16
Hi @Parfait, I will provide this on Monday as I am back at work. Thank you :)
– I Russu
Nov 25 at 1:16
@Parfait , I have edited the post with XML code (it is just one part of the file), the entire file looks like this. I said what I need - in the code, hope it helps! :) Thank you
– I Russu
Nov 26 at 8:46
@Parfait , I have edited the post with XML code (it is just one part of the file), the entire file looks like this. I said what I need - in the code, hope it helps! :) Thank you
– I Russu
Nov 26 at 8:46
Thanks for being so nice, and here is a link gist.github.com/ruioana/99b666f5192f47d0f016bb9154ca536b , I hope it makes sense now.
– I Russu
Nov 27 at 15:26
Thanks for being so nice, and here is a link gist.github.com/ruioana/99b666f5192f47d0f016bb9154ca536b , I hope it makes sense now.
– I Russu
Nov 27 at 15:26
Thanks for post. However, your XML is not valid as the namespace prefix in root dsr is not defined.
– Parfait
Nov 27 at 15:30
Thanks for post. However, your XML is not valid as the namespace prefix in root dsr is not defined.
– Parfait
Nov 27 at 15:30
|
show 2 more comments
1 Answer
1
active
oldest
votes
As mentioned, your needed nodes are at different levels of the XML and hence path expressions will be different for each data item. Additionally you need to traverse between two repeating levels: SalesToRecordCompanyByTerritory and ReleaseTransactionsToRecordCompany.
Therefore, consider parsing in nested for
loops. And rather than growing a data frame inside a loop, build a list of dictionaries that you can pass into pandas' DataFrame()
constructor outside of the loop. With this approach, you migrate dictionary keys as columns and elements as data.
Below uses chained find()
calls, long relative, or short absolute paths to navigate down the nested levels and retrieve corresponding element text values. Notice all parsing are relative to looped nodes with parent terr
and child rls
objects.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("file.xml")
data =
for terr in tree.findall('.//SalesToRecordCompanyByTerritory'):
for rls in terr.findall('.//ReleaseTransactionsToRecordCompany'):
inner = {}
# DESCENDANTS
inner['IRC'] = rls.find('./ReleaseId/ISRC').text
inner['IRC2'] = rls.find('./ReleaseId/ICPN').text
# CHILDREN
inner['Artist'] = rls.find('WMGArtistName').text
inner['Song'] = rls.find('WMGTitle').text
# DESCENDANTS
inner['Units'] = rls.find('./SalesTransactionToRecordCompany/SalesDataToRecordCompany/GrossNumberOfConsumerSales').text
inner['PPD'] = rls.find('Deal').find('AmountPayableInCurrencyOfAccounting').text
# PARENT
inner['TerritoryCode'] = terr.find('./TerritoryCode').text
data.append(inner)
df = pd.DataFrame(data)
You can shorten the find()
chains and long relative paths with absolute paths using .//
:
inner['IRC'] = rls.find('.//ISRC').text
inner['IRC2'] = rls.find('.//ICPN').text
inner['PPD'] = rls.find('.//AmountPayableInCurrencyOfAccounting').text
inner['Units'] = rls.find('.//GrossNumberOfConsumerSales').text
I would like to thank you for helping me and not closing down this question from the first look. You have been the most helpful person on StackOverflow! :) Can I cast a vote for the most helpful user?Hihi, thank you so much!
– I Russu
Nov 28 at 8:39
I have an issue with the TerritoryCode - it prints the same one, and I needed each one in the file but, it works perfectly otherwise. Many thanks :)
– I Russu
Nov 28 at 12:06
Haha...thanks! Next time before posting, always look at other highly upvoted posts especially in tag areas likepython
xml
else a less impatient user can downvote and close you out.
– Parfait
Nov 28 at 19:52
Your Github gist XML abbreviates middle sections. Once again, which node repeats? I only see one non-missing TerritoryCode. I assumed it was an ancestor item that is static (non-repeating) in entire XML document.
– Parfait
Nov 28 at 19:54
I have edited the GitHub gist - with the whole data for 1st territory, it goes like that till the end. I do hope it makes sense more now. Yes, I have looked everywhere on StackOverflow, for the same issue, I have found few, but it did not worked, that was one of the reason I have posted the question. Thank you :)
– I Russu
Nov 29 at 9:14
|
show 2 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53427905%2fextracting-data-from-xml-tree-into-pandas-csv-with-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
As mentioned, your needed nodes are at different levels of the XML and hence path expressions will be different for each data item. Additionally you need to traverse between two repeating levels: SalesToRecordCompanyByTerritory and ReleaseTransactionsToRecordCompany.
Therefore, consider parsing in nested for
loops. And rather than growing a data frame inside a loop, build a list of dictionaries that you can pass into pandas' DataFrame()
constructor outside of the loop. With this approach, you migrate dictionary keys as columns and elements as data.
Below uses chained find()
calls, long relative, or short absolute paths to navigate down the nested levels and retrieve corresponding element text values. Notice all parsing are relative to looped nodes with parent terr
and child rls
objects.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("file.xml")
data =
for terr in tree.findall('.//SalesToRecordCompanyByTerritory'):
for rls in terr.findall('.//ReleaseTransactionsToRecordCompany'):
inner = {}
# DESCENDANTS
inner['IRC'] = rls.find('./ReleaseId/ISRC').text
inner['IRC2'] = rls.find('./ReleaseId/ICPN').text
# CHILDREN
inner['Artist'] = rls.find('WMGArtistName').text
inner['Song'] = rls.find('WMGTitle').text
# DESCENDANTS
inner['Units'] = rls.find('./SalesTransactionToRecordCompany/SalesDataToRecordCompany/GrossNumberOfConsumerSales').text
inner['PPD'] = rls.find('Deal').find('AmountPayableInCurrencyOfAccounting').text
# PARENT
inner['TerritoryCode'] = terr.find('./TerritoryCode').text
data.append(inner)
df = pd.DataFrame(data)
You can shorten the find()
chains and long relative paths with absolute paths using .//
:
inner['IRC'] = rls.find('.//ISRC').text
inner['IRC2'] = rls.find('.//ICPN').text
inner['PPD'] = rls.find('.//AmountPayableInCurrencyOfAccounting').text
inner['Units'] = rls.find('.//GrossNumberOfConsumerSales').text
I would like to thank you for helping me and not closing down this question from the first look. You have been the most helpful person on StackOverflow! :) Can I cast a vote for the most helpful user?Hihi, thank you so much!
– I Russu
Nov 28 at 8:39
I have an issue with the TerritoryCode - it prints the same one, and I needed each one in the file but, it works perfectly otherwise. Many thanks :)
– I Russu
Nov 28 at 12:06
Haha...thanks! Next time before posting, always look at other highly upvoted posts especially in tag areas likepython
xml
else a less impatient user can downvote and close you out.
– Parfait
Nov 28 at 19:52
Your Github gist XML abbreviates middle sections. Once again, which node repeats? I only see one non-missing TerritoryCode. I assumed it was an ancestor item that is static (non-repeating) in entire XML document.
– Parfait
Nov 28 at 19:54
I have edited the GitHub gist - with the whole data for 1st territory, it goes like that till the end. I do hope it makes sense more now. Yes, I have looked everywhere on StackOverflow, for the same issue, I have found few, but it did not worked, that was one of the reason I have posted the question. Thank you :)
– I Russu
Nov 29 at 9:14
|
show 2 more comments
As mentioned, your needed nodes are at different levels of the XML and hence path expressions will be different for each data item. Additionally you need to traverse between two repeating levels: SalesToRecordCompanyByTerritory and ReleaseTransactionsToRecordCompany.
Therefore, consider parsing in nested for
loops. And rather than growing a data frame inside a loop, build a list of dictionaries that you can pass into pandas' DataFrame()
constructor outside of the loop. With this approach, you migrate dictionary keys as columns and elements as data.
Below uses chained find()
calls, long relative, or short absolute paths to navigate down the nested levels and retrieve corresponding element text values. Notice all parsing are relative to looped nodes with parent terr
and child rls
objects.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("file.xml")
data =
for terr in tree.findall('.//SalesToRecordCompanyByTerritory'):
for rls in terr.findall('.//ReleaseTransactionsToRecordCompany'):
inner = {}
# DESCENDANTS
inner['IRC'] = rls.find('./ReleaseId/ISRC').text
inner['IRC2'] = rls.find('./ReleaseId/ICPN').text
# CHILDREN
inner['Artist'] = rls.find('WMGArtistName').text
inner['Song'] = rls.find('WMGTitle').text
# DESCENDANTS
inner['Units'] = rls.find('./SalesTransactionToRecordCompany/SalesDataToRecordCompany/GrossNumberOfConsumerSales').text
inner['PPD'] = rls.find('Deal').find('AmountPayableInCurrencyOfAccounting').text
# PARENT
inner['TerritoryCode'] = terr.find('./TerritoryCode').text
data.append(inner)
df = pd.DataFrame(data)
You can shorten the find()
chains and long relative paths with absolute paths using .//
:
inner['IRC'] = rls.find('.//ISRC').text
inner['IRC2'] = rls.find('.//ICPN').text
inner['PPD'] = rls.find('.//AmountPayableInCurrencyOfAccounting').text
inner['Units'] = rls.find('.//GrossNumberOfConsumerSales').text
I would like to thank you for helping me and not closing down this question from the first look. You have been the most helpful person on StackOverflow! :) Can I cast a vote for the most helpful user?Hihi, thank you so much!
– I Russu
Nov 28 at 8:39
I have an issue with the TerritoryCode - it prints the same one, and I needed each one in the file but, it works perfectly otherwise. Many thanks :)
– I Russu
Nov 28 at 12:06
Haha...thanks! Next time before posting, always look at other highly upvoted posts especially in tag areas likepython
xml
else a less impatient user can downvote and close you out.
– Parfait
Nov 28 at 19:52
Your Github gist XML abbreviates middle sections. Once again, which node repeats? I only see one non-missing TerritoryCode. I assumed it was an ancestor item that is static (non-repeating) in entire XML document.
– Parfait
Nov 28 at 19:54
I have edited the GitHub gist - with the whole data for 1st territory, it goes like that till the end. I do hope it makes sense more now. Yes, I have looked everywhere on StackOverflow, for the same issue, I have found few, but it did not worked, that was one of the reason I have posted the question. Thank you :)
– I Russu
Nov 29 at 9:14
|
show 2 more comments
As mentioned, your needed nodes are at different levels of the XML and hence path expressions will be different for each data item. Additionally you need to traverse between two repeating levels: SalesToRecordCompanyByTerritory and ReleaseTransactionsToRecordCompany.
Therefore, consider parsing in nested for
loops. And rather than growing a data frame inside a loop, build a list of dictionaries that you can pass into pandas' DataFrame()
constructor outside of the loop. With this approach, you migrate dictionary keys as columns and elements as data.
Below uses chained find()
calls, long relative, or short absolute paths to navigate down the nested levels and retrieve corresponding element text values. Notice all parsing are relative to looped nodes with parent terr
and child rls
objects.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("file.xml")
data =
for terr in tree.findall('.//SalesToRecordCompanyByTerritory'):
for rls in terr.findall('.//ReleaseTransactionsToRecordCompany'):
inner = {}
# DESCENDANTS
inner['IRC'] = rls.find('./ReleaseId/ISRC').text
inner['IRC2'] = rls.find('./ReleaseId/ICPN').text
# CHILDREN
inner['Artist'] = rls.find('WMGArtistName').text
inner['Song'] = rls.find('WMGTitle').text
# DESCENDANTS
inner['Units'] = rls.find('./SalesTransactionToRecordCompany/SalesDataToRecordCompany/GrossNumberOfConsumerSales').text
inner['PPD'] = rls.find('Deal').find('AmountPayableInCurrencyOfAccounting').text
# PARENT
inner['TerritoryCode'] = terr.find('./TerritoryCode').text
data.append(inner)
df = pd.DataFrame(data)
You can shorten the find()
chains and long relative paths with absolute paths using .//
:
inner['IRC'] = rls.find('.//ISRC').text
inner['IRC2'] = rls.find('.//ICPN').text
inner['PPD'] = rls.find('.//AmountPayableInCurrencyOfAccounting').text
inner['Units'] = rls.find('.//GrossNumberOfConsumerSales').text
As mentioned, your needed nodes are at different levels of the XML and hence path expressions will be different for each data item. Additionally you need to traverse between two repeating levels: SalesToRecordCompanyByTerritory and ReleaseTransactionsToRecordCompany.
Therefore, consider parsing in nested for
loops. And rather than growing a data frame inside a loop, build a list of dictionaries that you can pass into pandas' DataFrame()
constructor outside of the loop. With this approach, you migrate dictionary keys as columns and elements as data.
Below uses chained find()
calls, long relative, or short absolute paths to navigate down the nested levels and retrieve corresponding element text values. Notice all parsing are relative to looped nodes with parent terr
and child rls
objects.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("file.xml")
data =
for terr in tree.findall('.//SalesToRecordCompanyByTerritory'):
for rls in terr.findall('.//ReleaseTransactionsToRecordCompany'):
inner = {}
# DESCENDANTS
inner['IRC'] = rls.find('./ReleaseId/ISRC').text
inner['IRC2'] = rls.find('./ReleaseId/ICPN').text
# CHILDREN
inner['Artist'] = rls.find('WMGArtistName').text
inner['Song'] = rls.find('WMGTitle').text
# DESCENDANTS
inner['Units'] = rls.find('./SalesTransactionToRecordCompany/SalesDataToRecordCompany/GrossNumberOfConsumerSales').text
inner['PPD'] = rls.find('Deal').find('AmountPayableInCurrencyOfAccounting').text
# PARENT
inner['TerritoryCode'] = terr.find('./TerritoryCode').text
data.append(inner)
df = pd.DataFrame(data)
You can shorten the find()
chains and long relative paths with absolute paths using .//
:
inner['IRC'] = rls.find('.//ISRC').text
inner['IRC2'] = rls.find('.//ICPN').text
inner['PPD'] = rls.find('.//AmountPayableInCurrencyOfAccounting').text
inner['Units'] = rls.find('.//GrossNumberOfConsumerSales').text
edited Nov 29 at 15:48
answered Nov 27 at 21:44
Parfait
49.2k84168
49.2k84168
I would like to thank you for helping me and not closing down this question from the first look. You have been the most helpful person on StackOverflow! :) Can I cast a vote for the most helpful user?Hihi, thank you so much!
– I Russu
Nov 28 at 8:39
I have an issue with the TerritoryCode - it prints the same one, and I needed each one in the file but, it works perfectly otherwise. Many thanks :)
– I Russu
Nov 28 at 12:06
Haha...thanks! Next time before posting, always look at other highly upvoted posts especially in tag areas likepython
xml
else a less impatient user can downvote and close you out.
– Parfait
Nov 28 at 19:52
Your Github gist XML abbreviates middle sections. Once again, which node repeats? I only see one non-missing TerritoryCode. I assumed it was an ancestor item that is static (non-repeating) in entire XML document.
– Parfait
Nov 28 at 19:54
I have edited the GitHub gist - with the whole data for 1st territory, it goes like that till the end. I do hope it makes sense more now. Yes, I have looked everywhere on StackOverflow, for the same issue, I have found few, but it did not worked, that was one of the reason I have posted the question. Thank you :)
– I Russu
Nov 29 at 9:14
|
show 2 more comments
I would like to thank you for helping me and not closing down this question from the first look. You have been the most helpful person on StackOverflow! :) Can I cast a vote for the most helpful user?Hihi, thank you so much!
– I Russu
Nov 28 at 8:39
I have an issue with the TerritoryCode - it prints the same one, and I needed each one in the file but, it works perfectly otherwise. Many thanks :)
– I Russu
Nov 28 at 12:06
Haha...thanks! Next time before posting, always look at other highly upvoted posts especially in tag areas likepython
xml
else a less impatient user can downvote and close you out.
– Parfait
Nov 28 at 19:52
Your Github gist XML abbreviates middle sections. Once again, which node repeats? I only see one non-missing TerritoryCode. I assumed it was an ancestor item that is static (non-repeating) in entire XML document.
– Parfait
Nov 28 at 19:54
I have edited the GitHub gist - with the whole data for 1st territory, it goes like that till the end. I do hope it makes sense more now. Yes, I have looked everywhere on StackOverflow, for the same issue, I have found few, but it did not worked, that was one of the reason I have posted the question. Thank you :)
– I Russu
Nov 29 at 9:14
I would like to thank you for helping me and not closing down this question from the first look. You have been the most helpful person on StackOverflow! :) Can I cast a vote for the most helpful user?Hihi, thank you so much!
– I Russu
Nov 28 at 8:39
I would like to thank you for helping me and not closing down this question from the first look. You have been the most helpful person on StackOverflow! :) Can I cast a vote for the most helpful user?Hihi, thank you so much!
– I Russu
Nov 28 at 8:39
I have an issue with the TerritoryCode - it prints the same one, and I needed each one in the file but, it works perfectly otherwise. Many thanks :)
– I Russu
Nov 28 at 12:06
I have an issue with the TerritoryCode - it prints the same one, and I needed each one in the file but, it works perfectly otherwise. Many thanks :)
– I Russu
Nov 28 at 12:06
Haha...thanks! Next time before posting, always look at other highly upvoted posts especially in tag areas like
python
xml
else a less impatient user can downvote and close you out.– Parfait
Nov 28 at 19:52
Haha...thanks! Next time before posting, always look at other highly upvoted posts especially in tag areas like
python
xml
else a less impatient user can downvote and close you out.– Parfait
Nov 28 at 19:52
Your Github gist XML abbreviates middle sections. Once again, which node repeats? I only see one non-missing TerritoryCode. I assumed it was an ancestor item that is static (non-repeating) in entire XML document.
– Parfait
Nov 28 at 19:54
Your Github gist XML abbreviates middle sections. Once again, which node repeats? I only see one non-missing TerritoryCode. I assumed it was an ancestor item that is static (non-repeating) in entire XML document.
– Parfait
Nov 28 at 19:54
I have edited the GitHub gist - with the whole data for 1st territory, it goes like that till the end. I do hope it makes sense more now. Yes, I have looked everywhere on StackOverflow, for the same issue, I have found few, but it did not worked, that was one of the reason I have posted the question. Thank you :)
– I Russu
Nov 29 at 9:14
I have edited the GitHub gist - with the whole data for 1st territory, it goes like that till the end. I do hope it makes sense more now. Yes, I have looked everywhere on StackOverflow, for the same issue, I have found few, but it did not worked, that was one of the reason I have posted the question. Thank you :)
– I Russu
Nov 29 at 9:14
|
show 2 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53427905%2fextracting-data-from-xml-tree-into-pandas-csv-with-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you provide at least a dummy example of XML as we need to see structure to help? By the way: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 22 at 18:47
Hi @Parfait, I will provide this on Monday as I am back at work. Thank you :)
– I Russu
Nov 25 at 1:16
@Parfait , I have edited the post with XML code (it is just one part of the file), the entire file looks like this. I said what I need - in the code, hope it helps! :) Thank you
– I Russu
Nov 26 at 8:46
Thanks for being so nice, and here is a link gist.github.com/ruioana/99b666f5192f47d0f016bb9154ca536b , I hope it makes sense now.
– I Russu
Nov 27 at 15:26
Thanks for post. However, your XML is not valid as the namespace prefix in root dsr is not defined.
– Parfait
Nov 27 at 15:30