• United States
IBM?
  • Site map
IBM? developerWorks   Developer Centers
  • Marketplace

  • Close
    Search
  • Sign in
    • Sign in
    • Register
  • IBM Navigation
dW Answers
  • Spaces
    • Blockchain
    • IBM Cloud platform
    • Internet of Things
    • Predictive Analytics
    • Watson
    • See all spaces
  • Tags
  • Users
  • Badges
  • FAQ
  • Help
Close

Name

developerWorks

  • Learn
  • Develop
  • Connect

Discover IBM

  • ConnectMarketplace
  • Products
  • Services
  • Industries
  • Careers
  • Partners
  • Support
10.190.13.206

Watson×

Refine your search by using the following advanced search options.

Criteria Usage
Questions with keyword1 or keyword2 keyword1 keyword2
Questions with a mandatory word, e.g. keyword2 keyword1 +keyword2
Questions excluding a word, e.g. keyword2 keyword1 -keyword2
Questions with keyword(s) and a specific tag keyword1 [tag1]
Questions with keyword(s) and either of two or more specific tags keyword1 [tag1] [tag2]
To search for all posts by a user or all posts with a specific tag, start typing and choose from the suggestion list. Do not use a plus or minus sign with a tag, e.g., +[tag1].
  • Ask a question

document_convert throwing error (The content of elements must consist of well-formed character data or markup.)

310002WH41 gravatar image
Question by GlassCaseOfEmotion  (3) | Mar 17, 2017 at 06:54 PM watsonsdkdocument-conversion

Using the example document_conversion_v1.py to convert PDF to normalized HTML. Should be super simple but its throwing a 400 error code.

I tested the PDF and it works via the watson demo, so I know its valid.

watson_developer_cloud.watson_developer_cloud_service.WatsonException: Error: The content of elements must consist of well-formed character data or markup., Code: 400

Also tested on a valid .DOC file, and it also threw a 400 error code.

watson_developer_cloud.watson_developer_cloud_service.WatsonException: Error: Element type "p" must be followed by either attribute specifications, ">" or "/>"., Code: 400

Any help would be very appreciated. Here is the salient code snippet.

with open(file_to_conv, 'r') as document: config = {'conversion_target': DocumentConversionV1.NORMALIZED_HTML} print(document_conversion.convert_document( document=document, config=config, media_type='text/html').content)

People who like this

  0
Comment
10 |3000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster

3 answers

  • Sort: 
270005KYC0 gravatar image
Accepted answer

Answer by Antoni Prevosti-Vives (592) | Mar 20, 2017 at 10:08 AM

If you're converting a PDF, the media_type should be application/pdf instead of text/html:

 document_conversion.convert_document( document=document, 
                                       config=config, 
                                       media_type='application/pdf')


As you can find in the API reference, the supported media types are:

  • text/html

  • text/xhtml+xml

  • application/pdf

  • application/msword

  • application/vnd.openxmlformats-officedocument.wordprocessingml.document

You can also choose not to set the media_type and let the service guess it for you, but if you know it, it is safer to set it.

Comment
MattFulgo

People who like this

  1   Share
10 |3000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
110000PNBC gravatar image

Answer by @chughts (8573) | Mar 20, 2017 at 08:02 AM

Your terminology of example document_conversion_v1.py, threw me, but it looks like you meant the pip installable watson-developer-cloud module that includes document_conversion_v1.py.

So it is not an example, its a library that you should install and invoke. How you call it is documented with the API at https://www.ibm.com/watson/developercloud/document-conversion/api/v1/?python#convert-document

 with open(('sample-docx.docx'), 'r') as document:
   response = document_conversion.convert_document(document=document, config=config)
   print(json.dumps(response, indent=2))
   ... # how ever you want to process the response.

which matches your code (in future please format your code as code as it makes it easier to read and hence decipher and debug. If you can't be bothered to format your code, then most of us will not be bothered to spend the time and effort it takes to read it).

 with open(file_to_conv, 'r') as document: 
     config = {'conversion_target': DocumentConversionV1.NORMALIZED_HTML} 
     print(document_conversion.convert_document( document=document, config=config, 
                               media_type='text/html').content)

I think you have spotted a bug in the Python SDK, as neither the cURL or the Node.js SDK expect media_type as input. Try running your code without media_type specified. The conversion_target config option should suffice.

Comment

People who like this

  0   Share
10 |3000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
310002WH41 gravatar image

Answer by GlassCaseOfEmotion (3) | Mar 20, 2017 at 10:52 PM

@chughts and @KYC0_Antoni_Prevosti-Vives, thanks for your answers. Both removing media_type and specifying media_type='application/pdf' worked.

Comment

People who like this

  0   Share
10 |3000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster

Follow this question

114 people are following this question.

Show
Hide
310002WH41 gravatar image
270000GWTC gravatar image
270000FDR4 gravatar image
270000XYBA gravatar image
110000716N gravatar image
270006XPGB gravatar image
12000089VF gravatar image
270007S0YR gravatar image
0 gravatar image
270000HCSB gravatar image
270007J4V5 gravatar image
060000TPFC gravatar image
270007TC7S gravatar image
270007N89G gravatar image
270007E13S gravatar image
270003G2C2 gravatar image
060000U5YA gravatar image
1200006P5U gravatar image
0600020WKY gravatar image
2700077Y16 gravatar image
060000GS8A gravatar image
27000412BP gravatar image
270003TDDU gravatar image
110000A7Q4 gravatar image
0600026V6R gravatar image
270007KSPS gravatar image
120000GKJX gravatar image
270005CRQH gravatar image
270002DX0R gravatar image
1100006DS0 gravatar image
120000PKE0 gravatar image
2700071CQ3 gravatar image
060000JCQ6 gravatar image
120000KBJ0 gravatar image
2700048B8W gravatar image
2700077GBQ gravatar image
2700050NEH gravatar image
270001KHBU gravatar image
270003U2JX gravatar image
1200007P68 gravatar image
060000PBF9 gravatar image
310000C3WF gravatar image
120000FVD3 gravatar image
2700078CT8 gravatar image
120000DJQR gravatar image
110000CPVN gravatar image
100000PUHW gravatar image
060000UPGT gravatar image
3100012P1N gravatar image
270005EH6S gravatar image
2700064F5C gravatar image
270002YGE4 gravatar image
120000K2Y8 gravatar image
31000066QG gravatar image
310001XRBV gravatar image
310000AVXW gravatar image
100000AYJ5 gravatar image
0600006FD4 gravatar image
270007817Q gravatar image
3100022S3B gravatar image
270003TGA5 gravatar image
0600029FSS gravatar image
270000S0MP gravatar image
270000W33V gravatar image
110000AF44 gravatar image
270003TTGW gravatar image
50WJRKN03C gravatar image
270001QB6R gravatar image
270000FXVC gravatar image
27000648WT gravatar image
270000CTQS gravatar image
310002CN85 gravatar image
110000C59H gravatar image
50T5CPU10M gravatar image
50G77GYY6D gravatar image
0600014AB6 gravatar image
3100026SAA gravatar image
3100015MKH gravatar image
310002BBMH gravatar image
270004YK46 gravatar image
310001F4NR gravatar image
270003XXWM gravatar image
270003Y1M4 gravatar image
31000098RE gravatar image
270006TJHJ gravatar image
310002BHAD gravatar image
270007QY2W gravatar image
2700013TF4 gravatar image
270001Y6MF gravatar image
2700039TS4 gravatar image
270001YQ13 gravatar image
50J5B74J8S gravatar image
5070VN6V9H gravatar image
270004MKK4 gravatar image
310000A2A3 gravatar image
27000341TY gravatar image
3100009MTN gravatar image
270007DV4D gravatar image
1000007SAX gravatar image
310000PX76 gravatar image
0600024THK gravatar image
110000NN2R gravatar image
270005MKK0 gravatar image
110000PNBC gravatar image
27000035QP gravatar image
270004QGTF gravatar image
2700051773 gravatar image
060001F069 gravatar image
270005NUPA gravatar image
1000000446 gravatar image
1200006ABB gravatar image
270007NBCN gravatar image
120000CMCV gravatar image
270005KYC0 gravatar image

Answers

Answers & comments

Related questions

Node.js cannot call Watson Document Conversion service from behind corporate proxy 1 Answer

Hi there, I am trying to read from a sample pdf file and print json object using document conversion api, but I keep getting the error 400 3 Answers

Unsupported media type in .net SDK while calling personality insight API (again). 2 Answers

"today" returns wrong date -- Timezone issue? 2 Answers

How can I use the Watson SpeechToText framework in Objective C? 2 Answers

  • Contact
  • Privacy
  • Terms of use
  • Accessibility
  • Report Abuse
  • Cookie Preferences

Powered by AnswerHub

Authentication check. Please ignore.
  • Anonymous
  • Sign in
  • Create
  • Ask a question
  • Spaces
  • API Connect
  • Application Performance Management
  • Appsecdev
  • BPM
  • Blockchain
  • Business Transaction Intelligence
  • CAPI
  • CAPI SNAP
  • CICS
  • Cloud Analytics
  • Cloud Automation
  • Cloud Object Storage
  • Cloud marketplace
  • Collaboration
  • Content Services (ECM)
  • Continuous Testing
  • Courses
  • Customer Experience Analytics
  • DB2 LUW
  • DataPower
  • Decision Optimization
  • DevOps Services
  • Digital Commerce
  • Digital Experience
  • Finance
  • Global Entrepreneur Program
  • Hadoop
  • IBM Cloud platform
  • IBM Design
  • IBM Forms Experience Builder
  • IBM Maximo Developer
  • IBM StoredIQ
  • IBM StoredIQ-Cartridges
  • IIDR
  • ITOA
  • InformationServer
  • Integration Bus
  • Internet of Things
  • Kenexa
  • Linux on Power
  • LinuxONE
  • MDM
  • Mainframe
  • Messaging
  • Node.js
  • ODM
  • Open
  • PowerAI
  • PowerVC
  • Predictive Analytics
  • Product Insights
  • PureData for Analytics
  • Push
  • QRadar App Development
  • Run Book Automation
  • Search Insights
  • Storage
  • Streamsdev
  • Supply Chain Business Network
  • Supply Chain Insights
  • Swift
  • UBX Capture
  • Universal Behavior Exchange
  • UrbanCode
  • WASdev
  • WSRR
  • Watson
  • Watson Campaign Automation
  • Watson Content Hub
  • Watson Marketing Insights
  • dW Answers Help
  • dW Premium
  • developerWorks Sandbox
  • developerWorks Team
  • Watson Health
  • More
  • Tags
  • Questions
  • Users
  • Badges