Re: [LAU] How to handle large amount of pdf files (music sheets) for collaborative usage

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Nils Hammerfest <list@...>
Cc: linux-audio-user <linux-audio-user@...>
Date: Saturday, April 3, 2010 - 4:29 pm

--0016367fae493e73700483579983
Content-Type: text/plain; charset=ISO-8859-1

PDFsam -- PDF Split-And-Merge is a handy open-source tool. (
http://www.pdfsam.org/)

But its title is its featureset, for the most part. It allows you to reorder
PDFs, pull pages out, add pages in, rotate pages 90, 180, 270 degrees, etc.
Command-line driven but there's also a gui console.

It's got a windows installer, but should run anywhere Java is available.

Sounds like on top of the scanning and organizing solution, you need to
figure out some OCR application to extract metadata from each of the PDFs in
a large-scale way.

If you're planning to put these out for public consumption, you can use
Google to assist you in your scanning and indexing:

http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-g...

Alternatively, the open-source OCR world is getting better fast. Check out
OCRopus (http://en.wikipedia.org/wiki/OCRopus) -- it's a linux-based
command-line OCR tool. You should be able to incorporate this into a
workflow, it'll spit out what it thinks your PDF says in htmlish (specified
here: http://docs.google.com/View?docid=dfxcv4vc_67g844kf).

I could see a workflow on your end that creates four rotations of each page
scanned, then attempts to OCR them in each degree of rotation with OCRopus,
compares the results, and persists in your datastore the one with the
highest combination of recognized characters and recognition score. I
suppose this is only really helpful if a) your PDFs often get scanned
upside-down or sideways, and b) all your PDFs have some amount of digital
typography on them.

Anyway, a couple ideas.

-----
Luke Peterson

On Sat, Apr 3, 2010 at 11:56 AM, Nils Hammerfest wrote:

> > Get a quality enterprise content management system. There is a

--0016367fae493e73700483579983
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

PDFsam -- PDF Split-And-Merge is a handy open-source tool. (http://www.pdfsam.org/)But its title is its=
featureset, for the most part. It allows you to reorder PDFs, pull pages o=
ut, add pages in, rotate pages 90, 180, 270 degrees, etc. Command-line driv=
en but there's also a gui console.
It's got a windows installer, but should run anywhere Java is avail=
able.Sounds like on top of the scanning and organizing solution, yo=
u need to figure out some OCR application to extract metadata from each of =
the PDFs in a large-scale way.
If you're planning to put these out for public consumption, you can=
use Google to assist you in your scanning and indexing:http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-g...
le-ocr/5158/">http://www.labnol.org/software/convert-scanned-pdf-images-to-=
text-with-google-ocr/5158/
Alternatively, the open-source OCR world is getting better fast. Check =
out OCRopus (http://en.wik=
ipedia.org/wiki/OCRopus
) -- it's a linux-based command-line OCR too=
l. You should be able to incorporate this into a workflow, it'll spit o=
ut what it thinks your PDF says in htmlish (specified here: http://docs.google.com/V=
iew?docid=3Ddfxcv4vc_67g844kf
).
I could see a workflow on your end that creates four rotations of each =
page scanned, then attempts to OCR them in each degree of rotation with OCR=
opus, compares the results, and persists in your datastore the one with the=
highest combination of recognized characters and recognition score. I supp=
ose this is only really helpful if a) your PDFs often get scanned upside-do=
wn or sideways, and b) all your PDFs have some amount of digital typography=
on them.
Anyway, a couple ideas.-----Luke Peterson
On Sat, Apr 3, 2010 at 11:56 AM, Nils Ha=
mmerfest <list@nils=
gey.de
> wrote:

There is a

> it goes:

Thanks for your information. I will have a look at this

> > 2) Do you know any free PDF editor besides "PDFEdit" (w=
hich seemed fine from screenshots and descriptions, but first tries were no=
t successful)

t

>

The problem here is that those PDFs are generated by scans. So in fac=
t we are dealing with mostly images of notation here. The main tasks here a=
re to add/delete pages, rotate single pages or "all uneven" etc. =
or crop borders for a section, selection, all etc. and in the end save them=
again. I think with some scripts it should be possible to do the decrompre=
ss/compress thing from pdf->images->pdf but still this leaves the poi=
nt of pagewise editing. I don't know for sure but I think gimp or inksc=
ape cannot do this.

Nils

_______________________________________________
Linux-audio-user mailing list
Linux-audio-user@l=
ists.linuxaudio.org

http://lists.linuxaudio.org/listinfo/linux-audio-user

--0016367fae493e73700483579983--

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [LAU] How to handle large amount of pdf files (music she..., Luke Peterson, (Sat Apr 3, 4:29 pm)
Re: [LAU] How to handle large amount of pdf files (music she..., Alexandros Diamantidis, (Sat Apr 3, 2:31 pm)
Re: [LAU] How to handle large amount of pdf files (music she..., Peter Desjardins, (Tue Mar 30, 11:56 am)