Viewed   61 times

I have a site where users can post stuff (as in forums, comments, etc) using a customised implementation of TinyMCE. A lot of them like to copy & paste from Word, which means their input often comes with a plethora of associated MS inline formatting.

I can't just get rid of <span whatever> as TinyMCE relies on the span tag for some of it's formatting, and I can't (and don't want to) force said users to use TinyMCE's "Paste From Word" feature (which doesn't seem to work that well anyway).

Anyone know of a library/class/function that would take care of this for me? It must be a common problem, though I can't find anything definitive. I've been thinking recently that a series of brute-force regexes looking for MS-specific patterns might do the trick, but I don't want to re-write something that may already be available unless I must.

Also, fixing of curly quotes, em-dashes, etc would be good. I have my own stuff to do this now, but I'd really just like to find one MS-conversion filter to rule them all.



HTML Purifier will create standards compliant markup and filter out many possible attacks (such as XSS).

For faster cleanups that don't require XSS filtering, I use the PECL extension Tidy which is a binding for the Tidy HTML utility.

If those don't help you, I suggest you switch to FCKEditor which has this feature built-in.

Friday, September 30, 2022

Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.

You could use the Microsoft Office XML formats for reading and writing Word files - this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it's called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I've never used this format for writing out Office documents from PHP, but I'm using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it's no problem to navigate within and figure out how to extract the data you need.

The other option - a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) - would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think - it just depends on how much time you'll invest.

Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.

Saturday, November 5, 2022
$money = array(
    'USD 50.45',
    'USD$ 50.45'

// remove everything except a digit "0-9", a comma ",", and a dot "."
$money = preg_replace('/[^d,.]/', '', $money);

// replace the comma with a dot, in the number format ",12" or ",43"
$money = preg_replace('/,(d{2})$/', '.$1', $money);



    [0] => 50.45
    [1] => 50.45
    [2] => 50.45
    [3] => 50.45
Sunday, October 9, 2022

So first off we'll need a few helper methods. We'll start off with this simple class to replace all instances of one expression with another:

internal class ReplaceVisitor : ExpressionVisitor
    private readonly Expression from, to;
    public ReplaceVisitor(Expression from, Expression to)
        this.from = from; = to;
    public override Expression Visit(Expression node)
        return node == from ? to : base.Visit(node);

Next we'll create an extension method to use it:

public static Expression Replace(this Expression expression,
    Expression searchEx, Expression replaceEx)
    return new ReplaceVisitor(searchEx, replaceEx).Visit(expression);

Finally, we'll create a Combine method that will combine two expressions together. It will take one expression that computes an intermediate result from a value, and then another that uses both the first value and the intermediate result to determine the final result.

public static Expression<Func<TFirstParam, TResult>>
    Combine<TFirstParam, TIntermediate, TResult>(
    this Expression<Func<TFirstParam, TIntermediate>> first,
    Expression<Func<TFirstParam, TIntermediate, TResult>> second)
    var param = Expression.Parameter(typeof(TFirstParam), "param");

    var newFirst = first.Body.Replace(first.Parameters[0], param);
    var newSecond = second.Body.Replace(second.Parameters[0], param)
        .Replace(second.Parameters[1], newFirst);

    return Expression.Lambda<Func<TFirstParam, TResult>>(newSecond, param);

Next we can define the method that computes the ExampleDCDTO objects given an example object. It will be a straight extraction of what you had above, with the exception that instead of returning an IEnumerable<ExampleDCDTO> it'll need to return an expression that turns an Example into such a sequence:

public Expression<Func<Example, IEnumerable<ExampleDCDTO>>> SelectDTO()
    return v => db.ExampleUDCs.Where(vudc => vudc.ExampleID == v.ExampleID)
        .Select(vudc => new ExampleDCDTO
            ExampleID = vudc.ExampleID,
            UDCHeadingID = vudc.UDCHeadingID,
            UDCValue = vudc.UDCValue

Now to bring it all together we can call this SelectDTO method to generate the expression that computes the intermediate value and Combine it with another expression that uses it:

public IQueryable<ExampleDTO> SelectDTO()
    ExampleUDCRepository repository = new ExampleUDCRepository();
    return db.Example
            .Select(repository.SelectDTO().Combine((v, exampleUDCs) =>
                new ExampleDTO()
                    ExampleID = v.ExampleID,
                    MasterGroupID = v.MasterGroupID,
                    ExampleUDCs = exampleUDCs,

Another option, for those using LINQKit, is to use AsExpandable instead of all of my helper methods. Using this approach would still require creating the SelectDTO method that return an Expression<Func<Example, IEnumerable<ExampleDCDTO>>>, but you would instead combine the result like so:

public IQueryable<ExampleDTO> SelectDTO()
    ExampleUDCRepository repository = new ExampleUDCRepository();
    var generateUDCExpression = repository.SelectDTO();
    return db.Example
        .Select(v =>
            new ExampleDTO()
                ExampleID = v.ExampleID,
                MasterGroupID = v.MasterGroupID,
                ExampleUDCs = generateUDCExpression.Invoke(v),
Thursday, November 17, 2022

I am not definetly sure what you want, but i guess you are trying to copy text from Word into tinymce. In order to get rid of all not wanted tags and other things like textdecoration you need to use the paste plugin. Use this settings for your init function:

plugins : "paste,...",
paste_use_dialog : false,
paste_auto_cleanup_on_paste : true,
paste_convert_headers_to_strong : false,
paste_strip_class_attributes : "all",
paste_remove_spans : true,
paste_remove_styles : true,
paste_retain_style_properties : "",

You may also use paste_preprocess and/or paste_postprocess setting to perform javascript action on the pasted code.

Wednesday, November 16, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :