Can I Read Microsoft Word Documanets on My Amazon Fire Hd 8
This post will talk nigh how to read Word Documents with Python. We're going to cover three unlike packages – docx2txt, docx, and my personal favorite: docx2python.
The docx2txt package
Let'due south talk virtually docx2text first. This is a Python package that allows y'all to scrape text and images from Word Documents. The example beneath reads in a Give-and-take Document containing the Zen of Python. Equally you can see, once we've imported docx2txt, all we need is one line of lawmaking to read in the text from the Discussion Document. We can read in the document using a method in the package called procedure, which takes the name of the file as input. Regular text, listed items, hyperlink text, and table text will all be returned in a unmarried cord.
import docx2txt # read in give-and-take file result = docx2txt.process("zen_of_python.docx")
What if the file has images? In that case nosotros just need a minor tweak to our code. When nosotros run the procedure method, we can pass an extra parameter that specifies the name of an output directory. Running docx2txt.process will extract whatever images in the Give-and-take Document and salve them into this specified folder. The text from the file will however also be extracted and stored in the outcome variable.
import docx2txt consequence = docx2txt.procedure("zen_of_python_with_image.docx", "C:/path/to/store/files")
Sample Image
docx2txt will also scrape any text from tables. Again, this will be returned into a single string with any other text institute in the certificate, which means this text can more difficult to parse. Later in this post nosotros'll talk about docx2python, which allows you to scrape tables in a more than structured format.
The docx packet
The source code behind docx2txt is derived from code in the docx package, which can also exist used to scrape Word Documents. docx is a powerful library for manipulating and creating Word Documents, just tin can also (with some restrictions) read in text from Word files.
In the case beneath, we open a connection to our sample word file using the docx.Document method. Here nosotros just input the name of the file we desire to connect to. So, we can scrape the text from each paragraph in the file using a list comprehension in conjunction with doc.paragraphs. This will include scraping split lines defined in the Word Certificate for listed items. Unlike docx2txt, docx, cannot scrape images from Word Documents. Also, docx will non scrape out hyperlinks and text in tables defined in the Discussion Document.
import docx # open connection to Give-and-take Document medico = docx.Document("zen_of_python.docx") # read in each paragraph in file result = [p.text for p in dr..paragraphs]
The docx2python package
docx2python is another package we tin use to scrape Word Documents. It has some additional features beyond docx2txt and docx. For instance, it is able to return the text scraped from a document in a more structured format. Permit'south test out our Discussion Certificate with docx2python. We're going to add a uncomplicated tabular array in the document so that we tin can extract that as well (see below).
docx2python contains a method with the same proper noun. If we call this method with the document's name as input, we go back an object with several attributes.
from docx2python import docx2python # extract docx content doc_result = docx2python('zen_of_python.docx')
Each attribute provides either text or data from the file. For instance, consider that our file has three main components – the text containing the Zen of Python, a tabular array, and an paradigm. If we call doc_result.body, each of these components will be returned every bit divide items in a list.
# get divide components of the document doc_result.body # get the text from Zen of Python doc_result[0] # get the image doc_result[1] # get the table text doc_result[2]
Scraping a word document table with docx2python
The table text issue is returned as a nested list, every bit you can run across below. Each row (including the header) gets returned every bit a split up sub-list. The 0th chemical element of the list refers to the header – or 0th row of the tabular array. The next element refers to the next row in the table and so on. In turn, each value in a row is returned every bit an private sub-list within that row's corresponding listing.
We can catechumen this consequence into a tabular format using pandas. The data frame is still a little messy – each cell in the data frame is a list containing a single value. This value also has quite a few "\t"'southward (which stand for tab spaces).
pd.DataFrame(doc_result.body[i][1:])
Hither, nosotros apply the applymap method to utilise the lambda function below to every cell in the information frame. This office gets the individual value within the listing in each cell and removes all instances of "\t".
import pandas equally pd pd.DataFrame(doc_result.torso[1][1:]).\ applymap(lambda val: val[0].strip("\t"))
Next, let'due south change the column headers to what we come across in the Word file (which was also returned to us in doc_result.body).
df.columns = [val[0].strip("\t") for val in doc_result.body[1][0]]
Extracting images
Nosotros can extract the Word file'south images using the images aspect of our doc_result object. doc_result.images consists of a lexicon where the keys are the names of the paradigm files (not automatically written to disk) and the respective values are the images files in binary format.
type(doc_result.images) # dict doc_result.images.keys() # dict_keys(['image1.png'])
Nosotros can write the binary-formatted image out to a physical file like this:
for key,val in doc_result.images.items(): f = open(central, "wb") f.write(val) f.close()
Above we're just looping through the keys (prototype file names) and values (binary images) in the dictionary and writing each out to file. In this case, nosotros only have one image in the document, so we just get one written out.
Other attributes
The docx2python outcome has several other attributes nosotros can use to extract text or information from the file. For example, if we desire to just get all of the file'due south text in a single cord (similar to docx2txt) nosotros tin can run doc_result.text.
# get all text in a single string doc_result.text
In addition to text, we can also get metadata near the file using the backdrop attribute. This returns information such as the creator of the document, the created / last modified dates, and number of revisions.
doc_result.properties
If the document you're scraping has headers and footers, you can besides scrape those out similar this (annotation the singular version of "header" and "footer"):
# go the headers doc_result.header # get the footers doc_result.footer
Footnotes can also be extracted like this:
doc_result.footnotes
Getting HTML returned with docx2python
We can also specify that we want to get an HTML object returned with the docx2python method that supports a few types of tags including font (size and color), italics, assuming, and underline text. Nosotros simply need to specify the parameter "html = Truthful". In the example below we see The Zen of Python in bold and underlined print. Corresponding to this, nosotros tin run into the HTML version of this in the second snapshot below. The HTML feature does not currently support table-related tags, and then I would recommend using the method we went through above if you're looking to scrape tables from Word documents.
doc_html_result = docx2python('zen_of_python.docx', html = Truthful)
Hope you enjoyed this post! Please bank check out other Python posts of mine beneath or by clicking here.
Source: http://theautomatic.net/2019/10/14/how-to-read-word-documents-with-python/
0 Response to "Can I Read Microsoft Word Documanets on My Amazon Fire Hd 8"
Enviar um comentário