Using the example document_conversion_v1.py to convert PDF to normalized HTML. Should be super simple but its throwing a 400 error code.
I tested the PDF and it works via the watson demo, so I know its valid.
watson_developer_cloud.watson_developer_cloud_service.WatsonException: Error: The content of elements must consist of well-formed character data or markup., Code: 400
Also tested on a valid .DOC file, and it also threw a 400 error code.
watson_developer_cloud.watson_developer_cloud_service.WatsonException: Error: Element type "p" must be followed by either attribute specifications, ">" or "/>"., Code: 400
Any help would be very appreciated. Here is the salient code snippet.
with open(file_to_conv, 'r') as document: config = {'conversion_target': DocumentConversionV1.NORMALIZED_HTML} print(document_conversion.convert_document( document=document, config=config, media_type='text/html').content)
Answer by Antoni Prevosti-Vives (592) | Mar 20, 2017 at 10:08 AM
If you're converting a PDF, the media_type should be application/pdf instead of text/html:
document_conversion.convert_document( document=document,
config=config,
media_type='application/pdf')
As you can find in the API reference, the supported media types are:
text/html
text/xhtml+xml
application/pdf
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
You can also choose not to set the media_type and let the service guess it for you, but if you know it, it is safer to set it.
Answer by @chughts (8573) | Mar 20, 2017 at 08:02 AM
Your terminology of example document_conversion_v1.py, threw me, but it looks like you meant the pip installable watson-developer-cloud module that includes document_conversion_v1.py.
So it is not an example, its a library that you should install and invoke. How you call it is documented with the API at https://www.ibm.com/watson/developercloud/document-conversion/api/v1/?python#convert-document
with open(('sample-docx.docx'), 'r') as document:
response = document_conversion.convert_document(document=document, config=config)
print(json.dumps(response, indent=2))
... # how ever you want to process the response.
which matches your code (in future please format your code as code as it makes it easier to read and hence decipher and debug. If you can't be bothered to format your code, then most of us will not be bothered to spend the time and effort it takes to read it).
with open(file_to_conv, 'r') as document:
config = {'conversion_target': DocumentConversionV1.NORMALIZED_HTML}
print(document_conversion.convert_document( document=document, config=config,
media_type='text/html').content)
I think you have spotted a bug in the Python SDK, as neither the cURL or the Node.js SDK expect media_type as input. Try running your code without media_type specified. The conversion_target
config option should suffice.
Node.js cannot call Watson Document Conversion service from behind corporate proxy 1 Answer
Unsupported media type in .net SDK while calling personality insight API (again). 2 Answers
"today" returns wrong date -- Timezone issue? 2 Answers
How can I use the Watson SpeechToText framework in Objective C? 2 Answers