MAT has now its own website : https://mat.boum.org
Ho, and btw, the first release is here !
Check it out : https://mat.boum.org/files
Metadata Anonymisation Toolkit
The blog of the GSoC 2011 of Julien (jvoisin) Voisin for the Tor project.
Sunday, October 30, 2011
Thursday, September 22, 2011
I'm back !
After a one month break, I'm back on MAT development.
So, what's new ?
Have a nice day
So, what's new ?
- Thank's to [4ZM] who send me patches, less bugs. Thank you !
- I fixed bugs too.
- The promised exiftool binding is here, it completely functional and optional. Jpeg and Png format are now handled with exiftool, hurray !
- Usual cleanup/bugfixes.
- More formats support, thanks to exiftool.
- Less bugs.
- The release of the first stable version.
- Some preliminary counter-measures against watermarking.
Have a nice day
Monday, August 29, 2011
End of the GSoC
It's the end of the GSoC : it was a really nice experience, I learned a lot, met a lot of nice people on irc, and earn some money.
My project was to create a Metadata Anonymisation Toolkit (MAT), to improve privacy of online files publications. First, I heavily based my code on hachoir (a nice, but a little bit complex library), but now, must of the formats that the MAT supports do not use hachoir. Despite several re-structurations/re-factorisation/ stupid ideas/re-implementations/re-writing/... the MAT is living !
I made two big mistakes : python2.7, and pygobject : none of them was in debian stable/tails, so I had to rewrite those parts.
It consist in a modular API (feel free to add support for other formats !), a command line interface, and a graphic user interface (powered by pygtk).
It was my first "serious" project in python, and I was the first surprised about the ~3000 lines that I produced. I'm pretty proud of the "pdf processing part", and I'm sad about the setup.py/packaging part (that are the most ugly/dirty/painful things that I ever touched/coded ).
I'm still unhappy with my code/piece of software, so I'll continue to improve it, so except nice stuffs, like an exiftool binding, watermark counter-measures, ..
Thank you mikeperry for being my mentor (even if you weren't present a lot ;),
thank you google for the amazing GSoC project,
thank to every user that gave me feedback (and even more stuffs to fix !),
and specials thanks to haypo, Mc2`, Kiri, intrigeri, bertagaz, Lunar^ and all #tails/#tor-dev !
See you next year ?
My project was to create a Metadata Anonymisation Toolkit (MAT), to improve privacy of online files publications. First, I heavily based my code on hachoir (a nice, but a little bit complex library), but now, must of the formats that the MAT supports do not use hachoir. Despite several re-structurations/re-factorisation/ stupid ideas/re-implementations/re-writing/... the MAT is living !
I made two big mistakes : python2.7, and pygobject : none of them was in debian stable/tails, so I had to rewrite those parts.
It consist in a modular API (feel free to add support for other formats !), a command line interface, and a graphic user interface (powered by pygtk).
It was my first "serious" project in python, and I was the first surprised about the ~3000 lines that I produced. I'm pretty proud of the "pdf processing part", and I'm sad about the setup.py/packaging part (that are the most ugly/dirty/painful things that I ever touched/coded ).
I'm still unhappy with my code/piece of software, so I'll continue to improve it, so except nice stuffs, like an exiftool binding, watermark counter-measures, ..
Thank you mikeperry for being my mentor (even if you weren't present a lot ;),
thank you google for the amazing GSoC project,
thank to every user that gave me feedback (and even more stuffs to fix !),
and specials thanks to haypo, Mc2`, Kiri, intrigeri, bertagaz, Lunar^ and all #tails/#tor-dev !
See you next year ?
Tuesday, August 23, 2011
Why am I using blacklists
Yes, that's right, I am using blacklists inside a security related tool !
As I have already explained in the previous post, I am using hachoir, and it can't guess about fields that it don't know about.
So, I know every possible fields that hachoir can expose to me for a given fileformat, and I know which one are, or aren't harmful for pricavy.
Since the number of harmless fields is superior to the number of harmless, I am using a blacklist.
And I think it's the right decision.
As I have already explained in the previous post, I am using hachoir, and it can't guess about fields that it don't know about.
So, I know every possible fields that hachoir can expose to me for a given fileformat, and I know which one are, or aren't harmful for pricavy.
Since the number of harmless fields is superior to the number of harmless, I am using a blacklist.
And I think it's the right decision.
Labels:
explanation
Why the MAT can't clean everything
Let me explain why (currently) the MAT can not remove every metadata fields.
I am using the hachoir library for images, and mpg audio format.
Hachoir is a Python library that allows to view and edit a binary stream field by field. In other words, Hachoir allows you to "browse" any binary stream just like you browse directories and files. A file is split in a tree of fields, where the smallest field is just one bit. There are other fields types: integers, strings, bits, padding types, floats, etc.
Hachoir is a great lib, but it can't guess fields that it don't know the about.
In a perfect world, hachoir would be perfect too, and would be able to perfectly parse any fileformat. Actually, it is not.
I'll take an example to explain the concern : the jpg format.
Image hachoir outputs me this :
But after this operation, exiftool still says that there are metadata inside my file !
The remaining metadata are inside the "header" field. The problem is that the header field contain vitals (number of colours, compression methods, ...) and non-vitals (metadatas, crap, ...) informations, so I can't just remove it !
That's why there are remaining metadata even after the MAT's cleaning.
So yes, I know that the MAT is not perfect, I am aware of this, and yes, I am working on this !
I am using the hachoir library for images, and mpg audio format.
Hachoir is a Python library that allows to view and edit a binary stream field by field. In other words, Hachoir allows you to "browse" any binary stream just like you browse directories and files. A file is split in a tree of fields, where the smallest field is just one bit. There are other fields types: integers, strings, bits, padding types, floats, etc.
Hachoir is a great lib, but it can't guess fields that it don't know the about.
In a perfect world, hachoir would be perfect too, and would be able to perfectly parse any fileformat. Actually, it is not.
I'll take an example to explain the concern : the jpg format.
Image hachoir outputs me this :
- header
- metadata
- data
- various_crap
- end
But after this operation, exiftool still says that there are metadata inside my file !
The remaining metadata are inside the "header" field. The problem is that the header field contain vitals (number of colours, compression methods, ...) and non-vitals (metadatas, crap, ...) informations, so I can't just remove it !
That's why there are remaining metadata even after the MAT's cleaning.
So yes, I know that the MAT is not perfect, I am aware of this, and yes, I am working on this !
Labels:
explanation,
technical
Saturday, August 20, 2011
Rage about zip.
Did you know that the zip format does not handle files with a atime/mtime older than 1980 ?
Yeah, me neither.
Yeah, me neither.
Thursday, August 18, 2011
Week 11 - Report
What I have done :
- Support of torrent files
- Packaging using setuptools
- A manpage
- Keyboard shortcuts for the GUI
- Small improvements/bugfixes based on user reviews.
- Defining with my mentor the threat model : google document
- Doing packages for debian is a pain.
- More bugfixes/improvements based on user reviews.
Thursday, August 11, 2011
Week 10 - Report
What I have done :
Things that I have learned :
I didn't commit a lot : I was visiting my dad.
- Localisation (French and English for now)
- Some king of asynchronous processing
- test/optimisations/bugfixes/cleanup/...
Things that I have learned :
- The gettext module is nice
- Documentation
I didn't commit a lot : I was visiting my dad.
Friday, August 5, 2011
Week 9 - Report
What I have done :
Things that I have learned :
- Backport of tarfile (finally !)
- The CLI and the GUI can display informations about supported file format.
- The test suite is no more dependant of optionals dependencies.
- Support of openxml office format
- Tooltips in the GUI !
- test/bugfixes/cleanup/...
Things that I have learned :
- xml.sax is weird
- xml is fun
- being sick sucks
- ?
Sunday, July 31, 2011
Week 8 - Report
What I have done :
- Clean (and smooth) support of pdf with python-poppler and python-cairo (I'm so proud of this !)
- Documentation/bugfixes/improvements/tests/...
- python-cairo is great and powerful, can't wait for the next version (with metadata support !)
- python-poppler is not documented, neither poppler. If someone ask you to work with poppler, run for your life !
- backport of tarfile support (I really don't want to do it)
- openxml format support
Thursday, July 28, 2011
Week 7.5 - Report
What I have done :
- GUI works on tails
- pep8 validation
- pyflakes validation
- pylint validation
- bugfixes (a lot)
- refactorisations
- more testfiles
- supported format are now "full supported"
- FLAC support
- Ogg support
- mutagen rocks !
- pep8, pylint, pyflakes rocks too !
- backport of tarfile support (I don't want to do it, it's a pain :< )
- Better pdf support with poppler (It's gonna be a pain too !)
- Microsoft office format ?
Tuesday, July 26, 2011
Sunday, July 24, 2011
Week 7 - Report
What I have done :
- GUI is done
- Open document support
- Zip support
- Partial torrent support
- Logging
- A nice setup.py script (thank you nickm !)
- A lot of internal cleanup/re-arrangement
- Massive use of tempfile
- Don't commit and push at 3h30AM
- Opendocument format is nice
- Less bugs
- More tests !
- Complete support of every fileformat supported so far
- Getting mat working on Tails
Monday, July 18, 2011
Week 6 - Report
What I have done :
- More about interface
- Some bugfixes
- Some logging stuff
- pygobject is still not documented at all
- mimetypes module in python is nice
- http://www.pygtk.org/pygtk2tutorial/index.html
- http://www.pygtk.org/docs/pygtk/index.html
- http://git.gnome.org/browse/pygobject/tree/pygi-convert.sh
- Selection in GUI for the clean/check/... function
- Maybe Drag and Drop
- ole2 : I haven't pushed what I had done on my main computer :/
- gzip : err, I have no excuses for this : I'll do it after the GUI part
Wednesday, July 13, 2011
Week 5.5 - Report
What I have done :
- First mock up of the interface
- is_clean() in the GUI
- bmp support (was easy)
- pygobject is not documented at all
- pdf fileformat is a pain, but ole2 is a lot uglier
- ole2 (I think I'm pretty close, but I don't know how to organise my code)
- more gui
Monday, July 11, 2011
Sunday, July 3, 2011
Week 4 - Report
What I have done :
- Full (ugly) support of pdf files !
- Full support of tar.* format (in python2.7)
- Some cosmetic changes
- Nice handling of un handled files
- debian stable is still in python 2.6
- Support of archives with python 2.6
- Support of zip archives
- Sketching/thinking about the GUI
Saturday, June 25, 2011
Week 3 - Report
What I have done :
- Support (partial) for tar archive
- Research about pdf
- Binding to shred
- A --backup option, and a --ugly which will implement ugly/destructive ways to anonymise a file.
- git-revert
- pdf is still a big mess
- It is harder to code when you forgot your laptop
- Preliminary support of pdf
- Support of tar.(gz|bz|whatever is supported by python)
Friday, June 24, 2011
Some funny facts about the pdf format :
- The "main" specification document is itself in pdf.
- The "main" specification document is more than 1300 pages long.
- Each new version of the pdf version is a major one.
- There are no (nice) python library able to do more than
split/extract/add/remove/count pages
- A free pdf library is on the "priority list" of the fsf.
- The less ugly solution that I have found so far to process pdf
is the transform every single page into a picture, treat the pictures,
then re-assemble them into a pdf (hachoir does not handle djvu,
and I have'nt found a nice djvu-handling-python-library).
- There are (surprisely) metadata fields, which are easy to manipulate.
- You can encapsulate whatever you want into pdf : pictures, text, fonts,
blobs, ...
Conclusion : The pdf format is not only an ugly mess for the programmer,
it's a privacy disaster.
I wonder if it was designed for steganography usage.
I'm so impatient to dig into .docx format.
- The "main" specification document is itself in pdf.
- The "main" specification document is more than 1300 pages long.
- Each new version of the pdf version is a major one.
- There are no (nice) python library able to do more than
split/extract/add/remove/count pages
- A free pdf library is on the "priority list" of the fsf.
- The less ugly solution that I have found so far to process pdf
is the transform every single page into a picture, treat the pictures,
then re-assemble them into a pdf (hachoir does not handle djvu,
and I have'nt found a nice djvu-handling-python-library).
- There are (surprisely) metadata fields, which are easy to manipulate.
- You can encapsulate whatever you want into pdf : pictures, text, fonts,
blobs, ...
Conclusion : The pdf format is not only an ugly mess for the programmer,
it's a privacy disaster.
I wonder if it was designed for steganography usage.
I'm so impatient to dig into .docx format.
Tuesday, June 21, 2011
Week 2.5 - Report
What I have done :
- Metadata removal works !
- The testsuite is all green !
- Support of jpg/png/mpeg audio
- Preliminary pdf support (only metadata so far, not embedded blob/media/data)
- Rewriting of the "display meta" function, which is now more friendly
- Preliminary support for "fields in fields" (like embedded metadata, or stupid designed file format)
- Hachoir is a pain
- The pdf spec is 1350 pages long
- pdfrw is a nice minimalist lib (basic pdf processing in python)
- The KISS principle is always great
- Support of embedded meta inside pdf (maybe a pdf -> DjVu -> pdf conversion ?)
- More files formats support, and tests for them
- Preliminary research for "secure removal"
Thursday, June 16, 2011
Week 1.5 - Report
- A complete rewriting of the argument parser
- Some internal cosmetic changes
- A robust CLI
- Reading a lot of the hachoir lib, thanks to the absence of documentation.
- The subprocess module is nice.
- Debian use python 2.6, and that sucks.
- argparse is way better than optparse
- Being able to remove a given meta from a file (That should already work :< ).
- Complete/polish the test suite.
- Organise my files into folders.
- They aren't many improvement since my last blogpost, but I was busy IRL. But I'm pretty confident about the completion of my "First two weeks" objectives.
Wednesday, June 8, 2011
Week 0.5 - report
What I have done these 3 days :
What should be done for the next report :
- A nice internal object oriented structure for representing a file, with his metadata, editor, name, ... and with (I think) most of the methods.
- A test suite (using to amazing module unittest) for the lib.
- I am able to get all the meta of any file supported by hachoir lib (and they are many, many, many).
- Don't try to early organize a projet in different folders.
- Maj + Suppr is stupid.
- The unittest module is great.
- The tempfile module is simple, but nice too.
- hachoir-meta is not able to modify metadata.
- Don't hesitate to completely rework part that seems ugly : never say "it works, so it's enough". Never.
What should be done for the next report :
- Argument parsing for the CLI.
- The beginning of a test suite for the CLI.
- Being able to remove a given meta from a file.
Sunday, June 5, 2011
Wednesday, May 25, 2011
Bonding Period - Report 3
Nothing new : I've got my exams next week, and a ton of project for school to finish.
Sunday, May 8, 2011
Bonding Period - Report 2
I wasn't really productive this week, because of the end of my holidays : I'm back at school :'<
I have continued to play around with hachoir (which does not have much documentation), and I'm re-reading "Dive into Python" <3.
So, nothing really interesting this week, sorry.
I have continued to play around with hachoir (which does not have much documentation), and I'm re-reading "Dive into Python" <3.
So, nothing really interesting this week, sorry.
Sunday, May 1, 2011
Bonding Period - Report 1
This week, I drunk some beer (mostly kwak) : I am going to be paid by google, for working on an great free project, weehee \o/
After that, I have played with Hachoir (test.py (stupid DNS)) : I'm now able to see if a file contains some meta or not. But the problem is that hachoir display fields, witch can be metadata (in red), or simply data.
Example on jpg file :
jvoisin@dementia:~/dev/GSoC $ python test.py ~/test.jpg
<JpegChunk path=/start_image, current_size=16, current length=2>
<JpegChunk path=/app0, current_size=144, current length=4>
<JpegChunk path=/exif, current_size=59872, current length=4>
<JpegChunk path=/photoshop, current_size=73296, current length=4>
<JpegChunk path=/chunk[0], current_size=32400, current length=4>
<JpegChunk path=/icc, current_size=25296, current length=4>
<JpegChunk path=/adobe, current_size=280, current length=4>
<JpegChunk path=/quantization[0], current_size=1072, current length=4>
<JpegChunk path=/start_frame, current_size=152, current length=4>
[...]
So far, Hachoir seems to be a great lib !
I also managed to completely reinstall my laptop (debian), and my desktop (gentoo_x64) to get clean development environment.
Finally, I played a little bit around with pygobject : It's nice, and really simple !
I have found a pack of examples (you can get it here : (stupid DNS again)) witch will be for sure really useful.
I've got some DNS problem : I'll upload the files later during this week.
I've got some DNS problem : I'll upload the files later during this week.
Tuesday, April 26, 2011
Timeline
Timeline:
After this first step (making the skeleton) I should be able to deliver a working system right after adding each of the following features. I Hope to get feedbacks so can fix problems quickly
- Community Bonding Period (in order of importance)
- Playing around with pygobject
- Playing with Hachoir
- Learning git
- First two weeks :
- create the structure in the repository (directories, README, ..)
- Create a skeleton
- Objectives : to have a deployable working system as soon as possible(even if the list of features is ridiculous). So that I can show you my work in an incremental way thereafter and get feedbacks early.
- The lib will handle reading/writing EXIF fields (using Hachoir)
- A set of tests files (and automated unit tests) to demonstrate that the lib does the job
- The beginning of the command line tool, at this point must list and delete EXIF meta
- An automated end-to-end test to show that the command line tool does properly remove the EXIF
After this first step (making the skeleton) I should be able to deliver a working system right after adding each of the following features. I Hope to get feedbacks so can fix problems quickly
- 3 weeks
- adding support for (in order of importance) pdf, zip/tar/bzip (just the meta, not the content yet), jpeg/png/bmp, ogg/mpeg1-2-3, exe...
- For every type of meta, that involves :
- Creating some input test files with meta data
- Implementing the feature in the library
- Asserting that the lib does the job with unit tests
- Modifying the cmd line tool to support the feature (if necessary)
- Checking that the cmd line tool can properly delete this type of meta with automated end-to-end test
- about one day
- Enable the command line tool to set a specific meta to a chosen value
- about 1 day
- Implementation of the “batch mode” in the cmdline tool, to clean a whole folder
- Implementation of secure removal
- about 2 days :
- Add support for deep archive cleanup
- Clean the content of the archives
- Make a list of non supported format, for which we warn the user that only the container can be cleaned from meta, not the content (at first that will include rar, 7zip, ..)
- The supported formats will be those supported natively by Python ( bzip2, gzip, tar )
- Create some test archives for each supported format containing various files with metas
- Implement the deep cleanup for the format
- Assert that the command line passes the end-to-end tests (that is, it can correctly clean the content of the test archives)
- about 2 days
- Add support for complete deletion of the original files
- Make a binding nice for shred (should not be to hard using Python)
- Implement the feature in the command line tool
- 3 weeks
- Implementation of the GUI tool
- At this stage, I can use the experience from implementing the cmdline tool to implement the GUI tool, having the same features.
- 1 week
- Add support for more format (might be based on requests from the community)
- Remaining weeks
- I want to keep those remaining week in case of problems, and for
- Remaining/polishing cleanup
- Bugfixing
- Integration work
- Missing features
- Packaging
- Final documentation
- Every Week-end :
- Documentation time : both end-user, and design. I do not like to document my code while I’m coding it : it slows a lot the development process, but it’s not a good thing to delay it too much : week-ends seems fine for this.
- A blog-post, and a mail on the mailing list about what I have done in the week.
Labels:
prensentation,
timeline
Design
Requirement/Deliverables:
I’d like to do this project in Python, because I already have done some personal projects whith it (for which I also used subversion) : an IRC bot tailored for logging (dustri.org/tor/depluie.tar.bz2 still under heavy WIP), a battery monitor, a simple search engine indexing FTP servers, ...
Why is Python a good choice for implementing this project ?
Proposed design:
The proposed design has three main components : one lib, a command line tool and a GUI tool.
The aim of the library (described with more details in the next part) is to make the development of tools easy. A special attention will be made on the API that it exposes. The ultimate goal being to be able to add the support of new file format in the library without changing the whole code of the tools.
Meta reading/writing library :
A library to read and write metas for various file formats. The main goal is to provide an abstraction interface (for the file format and for the underlying libraries used).
At first it would only wrap Hachoir.
Why hachoir :
But we could also wrap other libraries to support a particular file format. Or write ourself the support for a format, although this should be avoided if possible (it looks simple at first, but supporting different versions of the format and maintaining the thing over time is extremely time consuming)
The must would be to make the children libraries optional dependencies.
One typical use case of the lib is to ask for metadatas for a file, if the format is supported a list (or maybe a tree) of metas is returned.
Both the GUI and the cmdline tool will use this lib.
The cmdline/GUI tool features:
GUI:
Essentially the GUI tool would do the same features as for the cmd line too.
I do not have a significant GUI development experience, but I’m planing to fix that point during community bonding period.
- A command line and a GUI tool having both the following capabilities (in order of importance):
- Listing the metadatas embedded in a given file
- A batch mode to handle a whole directory (or set of directories)
- The ability to scan files packed in the most common archive formats
- A nice binding for srm (Secure ReMoval) or shred (GNU utils) to properly remove the original file containing the evil metas
- Let the user delete/modify a specific meta
- Should run on the most common OS/architectures (And especially on Debian Squeeze, since Tails is based on it.)
- The whole thing should be easily extensible (especially it should be easy to add support for new file formats)
- The proper functioning of the software should be easily testable
I’d like to do this project in Python, because I already have done some personal projects whith it (for which I also used subversion) : an IRC bot tailored for logging (dustri.org/tor/depluie.tar.bz2 still under heavy WIP), a battery monitor, a simple search engine indexing FTP servers, ...
Why is Python a good choice for implementing this project ?
- I am experienced with the language
- There are plenty of libraries to read/write metadatas, among them is Hachoir (https://bitbucket.org/haypo/hachoir/)that looks very promising since it supports quite a few file formats
- It is easy to wrap other libraries for our needs (even if they are not written in Python !)
- Runs on almost every OS/architecture, what is a great benefit for portability
- It is easy to make unit tests (thanks to the built-in Unittest module)
Proposed design:
The proposed design has three main components : one lib, a command line tool and a GUI tool.
The aim of the library (described with more details in the next part) is to make the development of tools easy. A special attention will be made on the API that it exposes. The ultimate goal being to be able to add the support of new file format in the library without changing the whole code of the tools.
Meta reading/writing library :
A library to read and write metas for various file formats. The main goal is to provide an abstraction interface (for the file format and for the underlying libraries used).
At first it would only wrap Hachoir.
Why hachoir :
- Autofix: Hachoir is able to open invalid / truncated files
- Lazy: Open a file is very fast since no information is read from file, data are read and/or computed when the user ask for it
- Types: Hachoir has many predefined field types (integer, bit, string, etc.) and supports string with charset (ISO-8859-1, UTF-8, UTF-16, ...)
- Addresses and sizes are stored in bit, so flags are stored as classic fields
- Editor: Using Hachoir representation of data, you can edit, insert, remove data and then save in a new file.
- Meta : Support a very large scale of file format
But we could also wrap other libraries to support a particular file format. Or write ourself the support for a format, although this should be avoided if possible (it looks simple at first, but supporting different versions of the format and maintaining the thing over time is extremely time consuming)
The must would be to make the children libraries optional dependencies.
One typical use case of the lib is to ask for metadatas for a file, if the format is supported a list (or maybe a tree) of metas is returned.
Both the GUI and the cmdline tool will use this lib.
The cmdline/GUI tool features:
- List all the meta
- Removing all the meta
- Anonymising all the meta
- Let the user chose wich meta he wants to modify
- Support archives anonymisation
- Secure removal
- Cleaning wholes folder recursively
GUI:
Essentially the GUI tool would do the same features as for the cmd line too.
I do not have a significant GUI development experience, but I’m planing to fix that point during community bonding period.
Labels:
prensentation
Monday, April 25, 2011
print('hello world')
Hello,
I am Julien (jvoisin) Voisin, an undergraduate computer science student from France and, it's my first GSoC !
I'm going to work for the Tor/Tails project this summer, and more specifically on a Metadata Anonymisation Toolkit.
It is a little bit scary to work for such a big project, but I'll try to do a good work, and to deliver a nice program :)
Labels:
presentation
Subscribe to:
Posts (Atom)