Sunday, October 30, 2011

Official Website

MAT has now its own website : https://mat.boum.org
Ho, and btw, the first release is here !
Check it out : https://mat.boum.org/files

Thursday, September 22, 2011

I'm back !

After a one month break, I'm back on MAT development.
So, what's new ?
  • Thank's to [4ZM] who send me patches, less bugs. Thank you !
  • I fixed bugs too.
  • The promised exiftool binding is here, it completely functional and optional. Jpeg and Png format are now handled with exiftool, hurray !
  • Usual cleanup/bugfixes.
Ok, and next ?
  • More formats support, thanks to exiftool.
  • Less bugs.
  • The release of the first stable version.
  • Some preliminary counter-measures against watermarking.
Feel free to reclaim functionality, to expose ideas, to submit patches, to report bugs, ...

Have a nice day

Monday, August 29, 2011

End of the GSoC

It's the end of the GSoC : it was a really nice experience, I learned a lot, met a lot of nice people on irc, and earn some money.

My project was to create a Metadata Anonymisation Toolkit (MAT), to improve privacy of online files publications. First, I heavily based my code on hachoir (a nice, but a little bit complex library), but now, must of the formats that the MAT supports do not use hachoir. Despite several re-structurations/re-factorisation/ stupid ideas/re-implementations/re-writing/... the MAT is living !
I made two big mistakes : python2.7, and pygobject : none of them was in debian stable/tails, so I had to rewrite those parts.

It consist in a modular API (feel free to add support for other formats !), a command line interface, and a graphic user interface (powered by pygtk).

It was my first "serious" project in python, and I was the first surprised about the ~3000 lines that I produced. I'm pretty proud of the "pdf processing part", and I'm sad about the setup.py/packaging part (that are the most ugly/dirty/painful things that I ever touched/coded ).
I'm still unhappy with my code/piece of software, so I'll continue to improve it, so except nice stuffs, like an exiftool binding, watermark counter-measures, ..

Thank you mikeperry for being my mentor (even if you weren't present a lot ;),
thank you google for the amazing GSoC project,
thank to every user that gave me feedback (and even more stuffs to fix !),
and specials thanks to haypo, Mc2`, Kiri, intrigeri, bertagaz, Lunar^ and all #tails/#tor-dev !

See you next year ?

Tuesday, August 23, 2011

Why am I using blacklists

Yes, that's right, I am using blacklists inside a security related tool !
As I have already explained in the previous post, I am using hachoir, and it can't guess about fields that it don't know about.

So, I know every possible fields that hachoir can expose to me for a given fileformat, and I know which one are, or aren't harmful for pricavy.

Since the number of harmless fields is superior to the number of harmless, I am using a blacklist.

And I think it's the right decision.

Why the MAT can't clean everything

Let me explain why (currently) the MAT can not remove every metadata fields.

I am using the hachoir library for images, and mpg audio format.

Hachoir is a Python library that allows to view and edit a binary stream field by field. In other words, Hachoir allows you to "browse" any binary stream just like you browse directories and files. A file is split in a tree of fields, where the smallest field is just one bit. There are other fields types: integers, strings, bits, padding types, floats, etc. 
 
Hachoir is a great lib, but it can't guess fields that it don't know the about.
In a perfect world, hachoir would be perfect too, and would be able to perfectly parse any fileformat. Actually, it is not.

I'll take an example to explain the concern : the jpg format.
Image hachoir outputs me this :
  • header
  • metadata
  • data
  • various_crap
  • end
I must remove the "metadata" and the "various_crap" field.
But after this operation, exiftool still says that there are metadata inside my file !

The remaining metadata are inside the "header" field. The problem is that the header field contain vitals (number of colours, compression methods, ...) and non-vitals (metadatas, crap, ...) informations, so I can't just remove it !

That's why there are remaining metadata even after the MAT's cleaning.

So yes, I know that the MAT is not perfect, I am aware of this, and yes, I am working on this !

Saturday, August 20, 2011

Rage about zip.

Did you know that the zip format does not handle files with a atime/mtime older than 1980 ?
Yeah, me neither.

Thursday, August 18, 2011

Week 11 - Report

What I have done :
  • Support of torrent files
  • Packaging using setuptools
  • A manpage
  • Keyboard shortcuts for the GUI
  • Small improvements/bugfixes based on user reviews.
  • Defining with my mentor the threat model : google document
  
    Things that I have learned :
    • Doing packages for debian is a pain.

      What should be done for the next report :
      • More bugfixes/improvements based on user reviews.

        Special thanks to intrigeri and bertagaz, who will be the MAT's package maintainers for debian : good luck with that !

        Thursday, August 11, 2011

        Week 10 - Report

        What I have done :
        • Localisation (French and English for now)
        • Some king of asynchronous processing
        • test/optimisations/bugfixes/cleanup/...

        Things that I have learned :

        What should be done for the next report :
        • Documentation

        I didn't commit a lot : I was visiting my dad.

        Friday, August 5, 2011

        Week 9 - Report

        What I have done :
        • Backport of tarfile (finally !)
        • The CLI and the GUI can display informations about supported file format.
        • The test suite is no more dependant of optionals dependencies.
        • Support of openxml office format
        • Tooltips in the GUI !
        • test/bugfixes/cleanup/...

        Things that I have learned :

        • xml.sax is weird
        • xml is fun
        • being sick sucks
        What should be done for the next report :
        • ?

        Sunday, July 31, 2011

        Week 8 - Report

        What I have done :

        • Clean (and smooth) support of pdf with python-poppler and python-cairo (I'm so proud of this !)
        • Documentation/bugfixes/improvements/tests/...
        Things that I have learned :
        • python-cairo is great and powerful, can't wait for the next version (with metadata support !)
        • python-poppler is not documented, neither poppler. If someone ask you to work with poppler, run for your life !
        What should be done for the next report :

        • backport of tarfile support (I really don't want to do it)
        • openxml format support

        Thursday, July 28, 2011

        Week 7.5 - Report

        What I have done :
        • GUI works on tails
        • pep8 validation
        • pyflakes validation
        • pylint validation
        • bugfixes (a lot)
        • refactorisations
        • more testfiles
        • supported format are now "full supported"
        • FLAC support
        • Ogg support
        Things that I have learned :
        • mutagen rocks !
        • pep8, pylint, pyflakes rocks too !
        What should be done for the next report :
        • backport of tarfile support (I don't want to do it, it's a pain :< )
        • Better pdf support with poppler (It's gonna be a pain too !) 
        • Microsoft office format ?

        Tuesday, July 26, 2011

        Tools

        Just a little list from tools that I used to check my code :

        Sunday, July 24, 2011

        Week 7 - Report

        What I have done :
        • GUI is done
        • Open document support
        • Zip support
        • Partial torrent support
        • Logging
        • A nice setup.py script (thank you nickm !)
        • A lot of internal cleanup/re-arrangement
        • Massive use of tempfile
        Things that I have learned :
        • Don't commit and push at 3h30AM
        • Opendocument format is nice
        What should be done for the next report :
        • Less bugs
        • More tests !
        • Complete support of every fileformat supported so far
        • Getting mat working on Tails

        Monday, July 18, 2011

        Week 6 - Report

        What I have done :
        • More about interface
        • Some bugfixes
        • Some logging stuff
        Things that I have learned :
        • pygobject is still not documented at all
        • mimetypes module in python is nice
        Some useful links :
        • http://www.pygtk.org/pygtk2tutorial/index.html
        • http://www.pygtk.org/docs/pygtk/index.html
        • http://git.gnome.org/browse/pygobject/tree/pygi-convert.sh
        What should be done for the next report :
        • Selection in GUI for the clean/check/... function
        • Maybe Drag and Drop
        What has not been done
        • ole2 : I haven't pushed what I had done on my main computer :/
        • gzip : err, I have no excuses for this : I'll do it after the GUI part

        Wednesday, July 13, 2011

        Week 5.5 - Report

        What I have done :
        • First mock up of the interface
        • is_clean() in the GUI
        • bmp support (was easy)
        Things that I have learned :
        • pygobject is not documented at all
        • pdf fileformat is a pain, but ole2 is a lot uglier
        What should be done for the next report :
        • ole2 (I think I'm pretty close, but I don't know how to organise my code)
        • more gui
        I'm going to be offline for a couple of days : http://www.lezartssceniques.com/index.php/affiche-generale/programmation-3-jours.html \o/

        Monday, July 11, 2011

        Week 5 - Report

        Nothing new this week : I was on holidays with my family.

        Sunday, July 3, 2011

        Week 4 - Report

        What I have done :
        • Full (ugly) support of pdf files !
        • Full support of tar.* format (in python2.7)
        • Some cosmetic changes
        • Nice handling of un handled files
        Things that I have learned :
        • debian stable is still in python 2.6
        What should be done for the next report :
        • Support of archives with python 2.6
        • Support of zip archives
        • Sketching/thinking about the GUI

        Saturday, June 25, 2011

        Week 3 - Report

        What I have done :
        • Support (partial) for tar archive
        • Research about pdf
        • Binding to shred
        • A --backup option, and a --ugly which will implement ugly/destructive ways to anonymise a file.
        Things that I have learned :
        • git-revert
        • pdf is still a big mess
        • It is harder to code when you forgot your laptop
        What should be done for the next report :
        • Preliminary support of pdf
        • Support of tar.(gz|bz|whatever is supported by python)

        Friday, June 24, 2011

        Pdf

        Some funny facts about the pdf format  :
        - The "main" specification document is itself in pdf.
        - The "main" specification document is more than 1300 pages long.
        - Each new version of the pdf version is a major one.
        - There are no (nice) python library able to do more than
        split/extract/add/remove/count pages
        - A free pdf library is on the "priority list" of the fsf.
        - The less ugly solution that I have found so far to process pdf
        is the transform every single page into a picture, treat the pictures,
        then re-assemble them into a pdf (hachoir does not handle djvu,
        and I have'nt found a nice djvu-handling-python-library).
        - There are (surprisely) metadata fields, which are easy to manipulate.
        - You can encapsulate whatever you want into pdf : pictures, text, fonts,
        blobs, ...

        Conclusion : The pdf format is not only an ugly mess for the programmer,
            it's a privacy disaster.
            I wonder if it was  designed for steganography usage.

        I'm so impatient to dig into .docx format.

        Tuesday, June 21, 2011

        Week 2.5 - Report

        What I have done :
        • Metadata removal works !
        • The testsuite is all green !
        • Support of jpg/png/mpeg audio
        • Preliminary pdf support (only metadata so far, not embedded blob/media/data)
        • Rewriting of the "display meta" function, which is now more friendly
        • Preliminary support for "fields in fields" (like embedded metadata, or stupid designed file format)
        Things that I have learned :
        • Hachoir is a pain
        • The pdf spec is 1350 pages long
        • pdfrw is a nice minimalist lib (basic pdf processing in python)
        • The KISS principle is always great
        What should be done for the next report :
        • Support of embedded meta inside pdf (maybe a pdf -> DjVu -> pdf conversion ?)
        • More files formats support, and tests for them
        •  Preliminary research for "secure removal"

        Thursday, June 16, 2011

        Week 1.5 - Report

        What I have done :
        • A complete rewriting of the argument parser
        • Some internal cosmetic changes
        • A robust CLI
        • Reading a lot of the hachoir lib, thanks to the absence of documentation.
        Things that I have learned :
        • The subprocess module is nice.
        • Debian use python 2.6, and that sucks.
        •  argparse is way better than optparse
        What should be done for the next report :
        • Being able to remove a given meta from a file (That should already work :< ).
        • Complete/polish the test suite.
        • Organise my files into folders.
        Conclusion
        • They aren't many improvement since my last blogpost, but I was busy IRL. But I'm pretty confident about the completion of my "First two weeks" objectives.

        Wednesday, June 8, 2011

        Week 0.5 - report

        What I have done these 3 days :
        • A nice internal object oriented structure for representing a file, with his metadata, editor, name, ... and with (I think) most of the methods.
        • A test suite (using to amazing module unittest) for the lib.
        • I am able to get all the meta of any file supported by hachoir lib (and they are many, many, many).
        Things that I have learned :
        • Don't try to early organize a projet in different folders.
        • Maj + Suppr is stupid.
        • The unittest module is great.
        • The tempfile module is simple, but nice too.
        • hachoir-meta is not able to modify metadata. 
        • Don't hesitate to completely rework part that seems ugly : never say "it works, so it's enough". Never.

        What should be done for the next report :
        • Argument parsing for the CLI.
        • The beginning of a test suite for the CLI.
        • Being able to remove a given meta from a file.

        Sunday, June 5, 2011

        Start !

        Holidays, yeah !
        Tea, calm and nap for today, I'll start coding tomorrow.

        Wednesday, May 25, 2011

        Bonding Period - Report 3

        Nothing new : I've got my exams next week, and a ton of project for school to finish.

        Sunday, May 8, 2011

        Bonding Period - Report 2

        I wasn't really productive this week, because of the end of my holidays : I'm back at school :'<
        I have continued to play around with hachoir (which does not have much documentation), and I'm re-reading "Dive into Python" <3.

        So, nothing really interesting this week, sorry.

        Sunday, May 1, 2011

        Bonding Period - Report 1

        This week, I drunk some beer (mostly kwak) : I am going to be paid by google, for working on an great free project, weehee \o/

        After that, I have played with Hachoir (test.py (stupid DNS)) : I'm now able to see if a file contains some meta or not. But the problem is that hachoir display fields, witch can be metadata (in red), or simply data.

        Example on jpg file :
        jvoisin@dementia:~/dev/GSoC $ python test.py ~/test.jpg
        <JpegChunk path=/start_image, current_size=16, current length=2>
        <JpegChunk path=/app0, current_size=144, current length=4>
        <JpegChunk path=/exif, current_size=59872, current length=4>
        <JpegChunk path=/photoshop, current_size=73296, current length=4>
        <JpegChunk path=/chunk[0], current_size=32400, current length=4>
        <JpegChunk path=/icc, current_size=25296, current length=4>
        <JpegChunk path=/adobe, current_size=280, current length=4>
        <JpegChunk path=/quantization[0], current_size=1072, current length=4>
        <JpegChunk path=/start_frame, current_size=152, current length=4>
        [...]
        So far, Hachoir seems to be a great lib !

        I also managed to completely reinstall my laptop (debian), and my desktop (gentoo_x64) to get clean development environment.

        Finally, I played a little bit around with pygobject : It's nice, and really simple !
        I have found a pack of examples (you can get it here : (stupid DNS again)) witch will be for sure really useful.

        I've got some DNS problem : I'll upload the files later during this week.

        Tuesday, April 26, 2011

        Timeline

        Timeline:
        • Community Bonding Period (in order of importance)
          • Playing around with pygobject
          • Playing with Hachoir
          • Learning git

        • First two weeks :
          • create the structure in the repository (directories, README, ..)
          • Create a skeleton

          • Objectives : to have a deployable working system as soon as possible(even if the list of features is ridiculous). So that I can show you my work in an incremental way thereafter and get feedbacks early.
          • The lib will handle reading/writing EXIF fields (using Hachoir)
          • A set of tests files (and automated unit tests) to demonstrate that the lib does the job
          • The beginning of the command line tool, at this point must list and delete EXIF meta
          • An automated end-to-end test to show that the command line tool does properly remove the EXIF


        After this first step (making the skeleton) I should be able to deliver a working system right after adding each of the following features. I Hope to get feedbacks so can fix problems quickly
        • 3 weeks
          • adding support for (in order of importance) pdf, zip/tar/bzip (just the meta, not the content yet), jpeg/png/bmp, ogg/mpeg1-2-3, exe...
          • For every type of meta, that involves :
            • Creating some input test files with meta data
            • Implementing the feature in the library
            • Asserting that the lib does the job with unit tests
            • Modifying the cmd line tool to support the feature (if necessary)
            • Checking that the cmd line tool can properly delete this type of meta with automated end-to-end test

        • about one day
          • Enable the command line tool to set a specific meta to a chosen value

        • about 1 day
          • Implementation of the “batch mode” in the cmdline tool, to clean a whole folder
          • Implementation of secure removal

        • about 2 days :
          • Add support for deep archive cleanup
            • Clean the content of the archives
            • Make a list of non supported format, for which we warn the user that only the container can be cleaned from meta, not the content (at first that will include rar, 7zip, ..)
            • The supported formats  will be those  supported natively by  Python ( bzip2, gzip, tar )
            • Create some test archives for each supported format containing various files with metas
            • Implement the deep cleanup for the format
            • Assert that the command line passes the end-to-end tests (that is, it can correctly clean the content of the test archives)

        • about 2 days
          • Add support for complete deletion of the original files
          • Make a binding nice for shred (should not be to hard using Python)
          • Implement the feature in the command line tool

           
        • 3 weeks
          • Implementation of the GUI tool
          • At this stage, I can use the experience from implementing the cmdline tool to implement the GUI tool, having the same features.

        • 1 week
          • Add support for more format (might be based on requests from the community)

        • Remaining weeks
          • I want to keep those remaining week in case of problems, and for
            • Remaining/polishing cleanup
            • Bugfixing
            • Integration work
            • Missing features
            • Packaging
            • Final documentation

        • Every Week-end :
          • Documentation time : both end-user, and design. I do not like to document my code while I’m coding it : it slows a lot the development process, but it’s not a good thing to delay it too much : week-ends seems fine for this.
          • A blog-post, and a mail on the mailing list about what I have done in the week.

        Design

        Requirement/Deliverables:
        • A command line and a GUI tool having both the following capabilities (in order of importance):
          • Listing the metadatas embedded in a given file
          • A batch mode to handle a whole directory (or set of directories)
          • The ability to scan files packed in the most common archive formats
          • A nice binding for srm (Secure ReMoval) or shred (GNU utils) to properly remove the original file containing the evil metas
          • Let the user delete/modify a specific meta

        • Should run on the most common OS/architectures (And especially on Debian Squeeze, since Tails is based on it.)
        • The whole thing should be easily extensible (especially it should be easy to add support for new file formats)
        • The proper functioning of the software should be easily testable

        I’d like to do this project in Python, because I already have done some personal projects whith it (for which I also used subversion) : an IRC bot tailored for logging  (dustri.org/tor/depluie.tar.bz2 still under heavy WIP), a battery monitor, a simple search engine indexing FTP servers, ...

        Why is Python a good choice for implementing this project ?
        1. I am experienced with the language
        2. There are plenty of libraries to read/write metadatas, among them is Hachoir (https://bitbucket.org/haypo/hachoir/)that looks very promising since it supports quite a few file formats
        3. It is easy to wrap other libraries for our needs (even if they are not written in Python !)
        4. Runs on almost every OS/architecture, what is a great benefit for portability
        5. It is easy to make unit tests (thanks to the built-in Unittest module)


        Proposed design:

        The proposed design has three main components : one lib, a command line tool and a GUI tool.

        The aim of the library (described with more details in the next part) is to make the development of tools easy. A special attention will be made on the API that it exposes. The ultimate goal being to be able to add the support of new file format in the library without changing the whole code of the tools.

        Meta reading/writing library :

        A library to read and write metas for various file formats. The main goal is to provide an abstraction interface (for the file format and for the underlying libraries used).
        At first it would only wrap Hachoir.
        Why hachoir :
        • Autofix: Hachoir is able to open invalid / truncated files
        • Lazy: Open a file is very fast since no information is read from file, data are read and/or computed when the user ask for it
        • Types: Hachoir has many predefined field types (integer, bit, string, etc.) and supports string with charset (ISO-8859-1, UTF-8, UTF-16, ...)
        • Addresses and sizes are stored in bit, so flags are stored as classic fields
        • Editor: Using Hachoir representation of data, you can edit, insert, remove data and then save in a new file.
        • Meta : Support a very large scale of file format


        But we could also wrap other libraries to support a particular file format. Or write ourself the support for a format, although this should be avoided if possible (it looks simple at first, but supporting different versions of the format and maintaining the thing over time is extremely time consuming)
        The must would be to make the children libraries optional dependencies.

        One typical use case of the lib is to ask for metadatas for a file, if the format is supported a list (or maybe a tree) of metas is returned.

        Both the GUI and the cmdline tool will use this lib.

        The cmdline/GUI tool features:
        • List all the meta
        • Removing all the meta
        • Anonymising all the meta
        • Let the user chose wich meta he wants to modify
        • Support archives anonymisation
        • Secure removal
        • Cleaning wholes folder recursively


        GUI:
        Essentially the GUI tool would do the same features as for the cmd line too.
        I do not have a significant GUI development experience, but I’m planing to fix that point during community bonding period.

        Monday, April 25, 2011

        print('hello world')

        Hello,
        I am Julien (jvoisin) Voisin, an undergraduate computer science student from France and, it's my first GSoC !

        I'm going to work for the Tor/Tails project this summer, and more specifically on a Metadata Anonymisation Toolkit.

        It is a little bit scary to work for such a big project, but I'll try to do a good work, and to deliver a nice program :)