Viewed   56 times

I'm trying to read .doc .docx file in php. All is working fine. But at last line I'm getting awful characters. Please help me. Here is code which is developed by someone.

    function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9s,.-nrt@/_()]/","",$outtext);
    return $outtext;
} 

$userDoc = "k.doc";

Here is screenshot.

 Answers

5

DOC files are not plain text.

Try a library such as PHPWord (old CodePlex site).

nb: This answer has been updated multiple times as PHPWord has changed hosting and functionality.

Tuesday, September 27, 2022
5

Here's what worked best for me when trying to script this (in case anyone else comes across this like I did):

$ pecl -d php_suffix=5.6 install <package>
$ pecl uninstall -r <package>

$ pecl -d php_suffix=7.0 install <package>
$ pecl uninstall -r <package>

$ pecl -d php_suffix=7.1 install <package>
$ pecl uninstall -r <package>

The -d php_suffix=<version> piece allows you to set config values at run time vs pre-setting them with pecl config-set. The uninstall -r bit does not actually uninstall it (from the docs):

vagrant@homestead:~$ pecl help uninstall
pecl uninstall [options] [channel/]<package> ...
Uninstalls one or more PEAR packages.  More than one package may be
specified at once.  Prefix with channel name to uninstall from a
channel not in your default channel (pecl.php.net)

Options:
  ...
  -r, --register-only
        do not remove files, only register the packages as not installed
  ...

The uninstall line is necessary otherwise installing it will remove any previously installed version, even if it was for a different PHP version (ex: Installing an extension for PHP 7.0 would remove the 5.6 version if the package was still registered as installed).

Monday, December 12, 2022
5

Unless you know the offset of the line, you will need to read every line up to that point. You can just throw away the old lines (that you don't want) by looping through the file with something like fgets(). (EDIT: Rather than fgets(), I would suggest @Gordon's solution)

Possibly a better solution would be to use a database, as the database engine will do the grunt work of storing the strings and allow you to (very efficiently) get a certain "line" (It wouldn't be a line but a record with an numeric ID, however it amounts to the same thing) without having to read the records before it.

Monday, October 24, 2022
1

You can use antiword command line utility to do this, I know most of you would have tried it but still I wanted to share.

  • Download antiword from here
  • Extract the antiword folder to C: and add the path C:antiword to your PATH environment variable.

Here is a sample of how to use it, handling docx and doc files:

import os, docx2txt
def get_doc_text(filepath, file):
    if file.endswith('.docx'):
       text = docx2txt.process(file)
       return text
    elif file.endswith('.doc'):
       # converting .doc to .docx
       doc_file = filepath + file
       docx_file = filepath + file + 'x'
       if not os.path.exists(docx_file):
          os.system('antiword ' + doc_file + ' > ' + docx_file)
          with open(docx_file) as f:
             text = f.read()
          os.remove(docx_file) #docx_file was just to read, so deleting
       else:
          # already a file with same name as doc exists having docx extension, 
          # which means it is a different file, so we cant read it
          print('Info : file with same name of doc exists having docx extension, so we cant read it')
          text = ''
       return text

Now call this function:

filepath = "D:\input\"
files = os.listdir(filepath)
for file in files:
    text = get_doc_text(filepath, file)
    print(text)

This could be good alternate way to read .doc file in Python on Windows.

Hope it helps, Thanks.

Tuesday, December 13, 2022
4

Never used any of those, but they look interesting..

Take a look at Gearman as well.. more overhead in systems like these but you get other cool stuff :) Guess it depends on your needs ..

Friday, November 11, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :