12

I need help replacing a string in a word document while keeping the formatting of the entire document.

I'm using python-docx, after reading the documentation, it works with entire paragraphs, so I loose formatting like words that are in bold or italics. Including the text to replace is in bold, and I would like to keep it that way. I'm using this code:

from docx import Document
def replace_string2(filename):
    doc = Document(filename)
    for p in doc.paragraphs:
        if 'Text to find and replace' in p.text:
            print 'SEARCH FOUND!!'
            text = p.text.replace('Text to find and replace', 'new text')
            style = p.style
            p.text = text
            p.style = style
    # doc.save(filename)
    doc.save('test.docx')
    return 1

So if I implement it and want something like (the paragraph containing the string to be replaced loses its formatting):

This is paragraph 1, and this is a text in bold.

This is paragraph 2, and I will replace old text

The current result is:

This is paragraph 1, and this is a text in bold.

This is paragraph 2, and I will replace new text

Alo
  • 796
  • 2
  • 6
  • 23
  • 1
    You could try using indices. Ex `for p in range(len(doc.paragraphs)): . . .`, and then set the paragraph back by `doc.paragraphs[p] = text`, assuming the doc.paragraphs returns a list like the documentation says. – cwahls Jan 14 '16 at 05:23
  • I think this gives the same output I'm currently getting. The file keeps its formatting, except the paragraph that contains the string to be replaced. Please correct me if I'm wrong. – Alo Jan 14 '16 at 05:43
  • 1
    [Text formatting is not saved when using assignment.](http://python-docx.readthedocs.org/en/latest/api/text.html#docx.text.paragraph.Paragraph.text) "Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed." – cwahls Jan 14 '16 at 06:18

4 Answers4

19

I posted this question (even though I saw a few identical ones on here), because none of those (to my knowledge) solved the issue. There was one using a oodocx library, which I tried, but did not work. So I found a workaround.

The code is very similar, but the logic is: when I find the paragraph that contains the string I wish to replace, add another loop using runs. (this will only work if the string I wish to replace has the same formatting).

def replace_string(filename):
    doc = Document(filename)
    for p in doc.paragraphs:
        if 'old text' in p.text:
            inline = p.runs
            # Loop added to work with runs (strings with same style)
            for i in range(len(inline)):
                if 'old text' in inline[i].text:
                    text = inline[i].text.replace('old text', 'new text')
                    inline[i].text = text
            print p.text

    doc.save('dest1.docx')
    return 1
Alo
  • 796
  • 2
  • 6
  • 23
  • 4
    I don't think this can possible work in all cases, the inline list refers to the internal items of the paragraph, hence the paragrah text is compound of them you won't find the full replacement text in one item – Tolo Palmer Dec 15 '17 at 18:00
  • The thing is that if you find the word you want to replace in a line, you will still replace the whole line by doing this instead of just the one word you want to replace. It's still better than replacing the whole paragraph though. – Francisco Peters May 28 '19 at 14:25
7

This is what works for me to retain the text style when replacing text.

Based on Alo's answer and the fact the search text can be split over several runs, here's what worked for me to replace placeholder text in a template docx file. It checks all the document paragraphs and any table cell contents for the placeholders.

Once the search text is found in a paragraph it loops through it's runs identifying which runs contains the partial text of the search text, after which it inserts the replacement text in the first run then blanks out the remaining search text characters in the remaining runs.

I hope this helps someone. Here's the gist if anyone wants to improve it

Edit: I have subsequently discovered python-docx-template which allows jinja2 style templating within a docx template. Here's a link to the documentation

def docx_replace(doc, data):
    paragraphs = list(doc.paragraphs)
    for t in doc.tables:
        for row in t.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    paragraphs.append(paragraph)
    for p in paragraphs:
        for key, val in data.items():
            key_name = '${{{}}}'.format(key) # I'm using placeholders in the form ${PlaceholderName}
            if key_name in p.text:
                inline = p.runs
                # Replace strings and retain the same style.
                # The text to be replaced can be split over several runs so
                # search through, identify which runs need to have text replaced
                # then replace the text in those identified
                started = False
                key_index = 0
                # found_runs is a list of (inline index, index of match, length of match)
                found_runs = list()
                found_all = False
                replace_done = False
                for i in range(len(inline)):

                    # case 1: found in single run so short circuit the replace
                    if key_name in inline[i].text and not started:
                        found_runs.append((i, inline[i].text.find(key_name), len(key_name)))
                        text = inline[i].text.replace(key_name, str(val))
                        inline[i].text = text
                        replace_done = True
                        found_all = True
                        break

                    if key_name[key_index] not in inline[i].text and not started:
                        # keep looking ...
                        continue

                    # case 2: search for partial text, find first run
                    if key_name[key_index] in inline[i].text and inline[i].text[-1] in key_name and not started:
                        # check sequence
                        start_index = inline[i].text.find(key_name[key_index])
                        check_length = len(inline[i].text)
                        for text_index in range(start_index, check_length):
                            if inline[i].text[text_index] != key_name[key_index]:
                                # no match so must be false positive
                                break
                        if key_index == 0:
                            started = True
                        chars_found = check_length - start_index
                        key_index += chars_found
                        found_runs.append((i, start_index, chars_found))
                        if key_index != len(key_name):
                            continue
                        else:
                            # found all chars in key_name
                            found_all = True
                            break

                    # case 2: search for partial text, find subsequent run
                    if key_name[key_index] in inline[i].text and started and not found_all:
                        # check sequence
                        chars_found = 0
                        check_length = len(inline[i].text)
                        for text_index in range(0, check_length):
                            if inline[i].text[text_index] == key_name[key_index]:
                                key_index += 1
                                chars_found += 1
                            else:
                                break
                        # no match so must be end
                        found_runs.append((i, 0, chars_found))
                        if key_index == len(key_name):
                            found_all = True
                            break

                if found_all and not replace_done:
                    for i, item in enumerate(found_runs):
                        index, start, length = [t for t in item]
                        if i == 0:
                            text = inline[index].text.replace(inline[index].text[start:start + length], str(val))
                            inline[index].text = text
                        else:
                            text = inline[index].text.replace(inline[index].text[start:start + length], '')
                            inline[index].text = text
                # print(p.text)

# usage

doc = docx.Document('path/to/template.docx')
docx_replace(doc, dict(ItemOne='replacement text', ItemTwo="Some replacement text\nand some more")
doc.save('path/to/destination.docx')
adejones
  • 512
  • 5
  • 10
4
from docx import Document

document = Document('old.docx')

dic = {'name':'ahmed','me':'zain'}
for p in document.paragraphs:
    inline = p.runs
    for i in range(len(inline)):
        text = inline[i].text
        if text in dic.keys():
            text=text.replace(text,dic[text])
            inline[i].text = text

document.save('new.docx')
zain
  • 49
  • 1
  • 2
  • 2
    Welcome Zain. Could you, please, add some context to your answer, explaining why the code you posted fixes the issue on the original question? – gmauch Aug 05 '19 at 19:58
2

According to the architecture of the DOCX document:

  1. Text: doc>Paragraph>run
  2. Text table: doc>Form>row>col>cell>Paragraph>run
  3. Header: doc>sections>header>Paragraph>run
  4. Header table: doc>sections>header>Form>row>col>cell>Paragraph>run

The footer is the same as the header, we can directly traverse the paragraph to find and replace our keywords, but this will cause the text format to be reset, so we can only traverse the words in the run and replace them. However, as our keywords may exceed the length range of the run, we cannot replace them successfully.

Therefore, I provide an idea here: firstly, take paragraph as unit, and mark the position of every character in paragraph through list; then, mark the position of every character in run through list; find keywords in paragraph, delete and replace them by character as unit by corresponding relation.

'''
-*- coding: utf-8 -*-
@Time    : 2021/4/19 13:13
@Author  : ZCG
@Site    : 
@File    : Batch DOCX document keyword replacement.py
@Software: PyCharm
'''

from docx import Document
import os
import tqdm

def get_docx_list(dir_path):
    '''
    :param dir_path:
    :return: List of docx files in the current directory
    '''
    file_list = []
    for path,dir,files in os.walk(dir_path):
        for file in files:
            if file.endswith("docx") == True and str(file[0]) != "~":  #Locate the docx document and exclude temporary files
                file_root = path+"\\"+file
                file_list.append(file_root)
    print("The directory found a total of {0} related files!".format(len(file_list)))
    return file_list

class ParagraphsKeyWordsReplace:
    '''
        self:paragraph
    '''
    def paragraph_keywords_replace(self,x,key,value):
        '''
        :param x:  paragraph index
        :param key: Key words to be replaced
        :param value: Replace the key words
        :return:
        '''
        keywords_list = [s for s in range(len(self.text)) if self.text.find(key, s) == s] # Retrieve the number of occurrences of the Key in this paragraph and record the starting position in the List
        # there if use: while self.text.find(key) >= 0,When {"ab":" ABC "} is encountered, it will enter an infinite loop
        while len(keywords_list)>0:             #If this paragraph contains more than one key, you need to iterate
            index_list = [] #Gets the index value for all characters in this paragraph
            for y, run in enumerate(self.runs):  # Read the index of run
                for z, char in enumerate(list(run.text)):  # Read the index of the chars in the run
                    position = {"run": y, "char": z}  # Give each character a dictionary index
                    index_list.append(position)
            # print(index_list)
            start_i = keywords_list.pop()   # Fetch the starting position containing the key from the back to the front of the list
            end_i = start_i + len(key)      # Determine where the key word ends in the paragraph
            keywords_index_list = index_list[start_i:end_i]  # Intercept the section of a list that contains keywords in a paragraph
            # print(keywords_index_list)
            # return keywords_index_list    #Returns a list of coordinates for the chars associated with keywords
            ParagraphsKeyWordsReplace.character_replace(self, keywords_index_list, value)
            # print(f"Successful replacement:{key}===>{value}")

    def character_replace(self,keywords_index_list,value):
        '''
        :param keywords_index_list: A list of indexed dictionaries containing keywords
        :param value: The new word after the replacement
        : return:
        Receive parameters and delete the characters in keywords_index_list back-to-back, reserving the first character to replace with value
        Note: Do not delete the list in reverse order, otherwise the list length change will cause a string index out of range error
        '''
        while len(keywords_index_list) > 0:
            dict = keywords_index_list.pop()    #Deletes the last element and returns its value
            y = dict["run"]
            z = dict["char"]
            run = self.runs[y]
            char = self.runs[y].text[z]
            if len(keywords_index_list) > 0:
                run.text = run.text.replace(char, "")       #Delete the [1:] character
            elif len(keywords_index_list) == 0:
                run.text = run.text.replace(char, value)    #Replace the 0th character

class DocxKeyWordsReplace:
    '''
        self:docx
    '''
    def content(self,replace_dict):
        print("Please wait for a moment, the body content is processed...")
        for key, value in tqdm.tqdm(replace_dict.items()):
            for x,paragraph in enumerate(self.paragraphs):
                ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph,x,key,value)

    def tables(self,replace_dict):
        print("Please wait for a moment, the body tables is processed...")
        for key,value in tqdm.tqdm(replace_dict.items()):
            for i,table in enumerate(self.tables):
                for j,row in enumerate(table.rows):
                    for cell in row.cells:
                        for x,paragraph in enumerate(cell.paragraphs):
                            ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph,x,key,value)

    def header_content(self,replace_dict):
        print("Please wait for a moment, the header body content is processed...")
        for key,value in tqdm.tqdm(replace_dict.items()):
            for i,sections in enumerate(self.sections):
                for x,paragraph in enumerate(self.sections[i].header.paragraphs):
                    ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)

    def header_tables(self,replace_dict):
        print("Please wait for a moment, the header body tables is processed...")
        for key,value in tqdm.tqdm(replace_dict.items()):
            for i,sections in enumerate(self.sections):
                for j,tables in enumerate(self.sections[i].header.tables):
                    for k,row in enumerate(tables[j].rows):
                        for l,cell in row.cells:
                            for x, paragraph in enumerate(cell.paragraphs):
                                ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)

    def footer_content(self, replace_dict):
        print("Please wait for a moment, the footer body content is processed...")
        for key,value in tqdm.tqdm(replace_dict.items()):
            for i, sections in enumerate(self.sections):
                for x, paragraph in enumerate(self.sections[i].footer.paragraphs):
                    ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)


    def footer_tables(self, replace_dict):
        print("Please wait for a moment, the footer body tables is processed...")
        for key,value in tqdm.tqdm(replace_dict.items()):
            for i, sections in enumerate(self.sections):
                for j, tables in enumerate(self.sections[i].footer.tables):
                    for k, row in enumerate(tables[j].rows):
                        for l, cell in row.cells:
                            for x, paragraph in enumerate(cell.paragraphs):
                                ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)

def main():
    '''
    How to use it: Modify the values in replace_dict and file_dir
    Replace_dict: The following dictionary corresponds to the format, with key as the content to be replaced and value as the new content
    File_dir: The directory where the docx file resides. Supports subdirectories
    '''
    # Input part
    replace_dict = {
        "MG life technology (shenzhen) co., LTD":"Shenzhen YW medical technology co., LTD",
        "MG-":"YW-",
        "2017-":"2020-",
        "Z18":"Z20",

        }
    file_dir = r"D:\Working Files\SVN\"
    # Call processing part
    for i,file in enumerate(get_docx_list(file_dir),start=1):
        print(f"{i}、Files in progress:{file}")
        docx = Document(file)
        DocxKeyWordsReplace.content(docx, replace_dict=replace_dict)
        DocxKeyWordsReplace.tables(docx, replace_dict=replace_dict)
        DocxKeyWordsReplace.header_content(docx, replace_dict=replace_dict)
        DocxKeyWordsReplace.header_tables(docx, replace_dict=replace_dict)
        DocxKeyWordsReplace.footer_content(docx, replace_dict=replace_dict)
        DocxKeyWordsReplace.footer_tables(docx, replace_dict=replace_dict)
        docx.save(file)
        print("This document has been processed!\n")

if __name__ == "__main__":
    main()
    print("All complete processing!")
CG Zhang
  • 21
  • 3