Follow Slashdot blog updates by subscribing to our blog RSS feed


Forgot your password?
Programming Linux

Programming Things I Wish I Knew Earlier 590

theodp writes "Raw intellect ain't always all it's cracked up to be, advises Ted Dziuba in his introduction to Programming Things I Wish I Knew Earlier, so don't be too stubborn to learn the things that can save you from the headaches of over-engineering. Here's some sample how-to-avoid-over-complicating-things advice: 'If Linux can do it, you shouldn't. Don't use Hadoop MapReduce until you have a solid reason why xargs won't solve your problem. Don't implement your own lockservice when Linux's advisory file locking works just fine. Don't do image processing work with PIL unless you have proven that command-line ImageMagick won't do the job. Modern Linux distributions are capable of a lot, and most hard problems are already solved for you. You just need to know where to look.' Any cautionary tips you'd like to share from your own experience?"
This discussion has been archived. No new comments can be posted.

Programming Things I Wish I Knew Earlier

Comments Filter:
  • by msobkow ( 48369 ) on Monday September 06, 2010 @02:23PM (#33490298) Homepage Journal

    The truth is that the "hard" way of doing things is often more fun, because you have the challenge of learning a new tool or API. Plus sometimes it's actually easier in the long run because you've engineered a solution for the outer bounds conditions of scalability, so if your application takes off, it can handle the load.

    I guess the real issue is that you have to engineer a "good enough" solution rather than a "worst case" solution.

  • Re:Comment your code (Score:5, Interesting)

    by russotto ( 537200 ) on Monday September 06, 2010 @03:02PM (#33490730) Journal

    Commenting code isn't enough, it's just a small part of the design and documentation process. Comments are there to tie the code to the relevant part in your design document, which really is a part of programming people should put more effort into.

    It's been said for years, but it is almost never done. When it is done, it's most often (IME) done _after the fact_ because of some requirement to produce the paperwork. Perhaps it's time to give up on it. Is there a real reason for insisting on a design document, or is it just some sort of self-flagellation on the part of programmers?

  • by Murdoch5 ( 1563847 ) on Monday September 06, 2010 @03:30PM (#33491066)

    I know these might sound odd but hear me out. Start by trying to rewrite the basic library's. make your own printf, strcpy. strlen etc..... Write copies of your own link list and tree storage methods and above all really really start to understand how memory works.

    Another really really important thing that I have learned is stay FAR FAR away from OO programming until your really really comfortable in lower level languages. The reason is that to many students and beginners sit there trying to figure out why there variable started with value X and ended up with value Y only to find out that there object bashed some memory earlier on.

    Basically just grab a good C compiler, I mean C COMPILER, not C++, not C#, not F# and start to learn how all the functions you use on a dailiy basis work, it will give you new insight into why and how you can quickly avoid and fix problems when and before they happen. It's also important to get a really really good handle on using a CLI over a GUI, Stay away from Visual Studio and other simular compilers. Use GCC and CC and make sure you look at how LD is working and understand how compilers do optimizations and improvements to your own code. This post is taking about grabbing tool to do Image processing and preform functions that have working solutions. However taking the time to see how the solutions work and why they work will give you good insight into not only great code design but great programming methods. It might seem odd for me to suggest to a beginner to try and rewrite strcpy or strcmp, but once you see how they really work you'll be far less likely to make the simple mistakes that can ground your program / project from working. It's the same say with with a beginner figuring out how malloc works and where memory gets taken from and put to, all of these suggestions are coming from the way I learned to program in C and other languages.

    Feel free to throw any of these away or take any of them into your own programming adventure, but one thing is for sure. When you can figure out how the basic functions you use every day work it will save you hours and days of trouble shooting and leave you with a greater pallet of tools to use in the class room and on the job. I welcome and one who wants to add ideas to this post and attack it with there own view points.
  • by msobkow ( 48369 ) on Monday September 06, 2010 @03:40PM (#33491166) Homepage Journal

    Actually I volunteer for the projects that are going to expose me to something new, rather than only taking on projects where I already know how the solution will work. The latter are bread-n-butter to the company, the former are the future of the company.

    For example, I've spent the past year on a Freeswitch project rather than on the older Asterisk based code. Freeswitch scales better, is better architected, and is more flexible. The downside was spending 3-6 months working with the Freeswitch team to resolve issues with the code.

    In the end, Freeswitch is where we are going; Asterisk is where we were. At the time that the Asterisk code was started, Freeswitch hadn't even reached it's first release, so Freeswitch wasn't an option back then.

    Next up is a rework of the database IO codebase so that it becomes feasible to plug-n-play different databases. We could do it with the existing code base, but it would be very painful, kludgy, and difficult to maintain. Instead we're going to make a clean break on our next release to a new architecture for the database code. Sure it'll take longer at first -- but by the time we're on to our third database we should be well ahead of the curve and saving time.

  • Re:Comment your code (Score:2, Interesting)

    by aLEczapKA ( 452675 ) on Monday September 06, 2010 @04:01PM (#33491396)

    If you need to comment your code you did something wrong and you should refactor it.

    Inventor of C++ when asked 'How do you debug your code?', said: 'I don't, if I have to debug the code to understand it, it means the code is wrong and I rewrite it'.

    Great book on the subject Clean Code: A Handbook of Agile Software Craftsmanship by Robert C. Martin.

  • by SanityInAnarchy ( 655584 ) <> on Monday September 06, 2010 @04:20PM (#33491622) Journal

    If you are writing a program that touches more than two persistent data stores, it is too complicated.

    Others have already mentioned cases where multiple datastores make sense. A trivial example: One database to handle user data, another to handle blobs (image conversions, etc) -- bonus if the second store can do its own conversions; a third to handle logging -- that's already three, and that's before we start considering things like RESTful services, which can function as intelligent datastores of their own...

    If Linux can do it, you shouldn't.

    Unless you're not on Linux. And, specifically:

    Don't do image processing work with PIL unless you have proven that command-line ImageMagick won't do the job.

    If you're doing something that truly works as a shell script, and isn't part of a larger app, I agree. However, PIL likely performs better, and it removes the shell as an issue -- if you thought SQL injection was bad, wait till you have people exploiting your shell commands. You can do it safely, but why would you bother, when you've got libraries that accept Python (or Perl, or Ruby) native arguments, rather than forcing you to deal with commandline arguments? Why do you want to check return values, when you can have these native libraries throw exceptions?

    Parallelize When You Have To, Not When You Want To

    If you don't at least think about parallelization in the planning stage, it's going to be painful later on. It's easy to build a shared-nothing, stateless architecture and run it in a single-threaded way. It's hard to build a stateful web service with huge, heavyweight sessions, and then make it run on even two application servers in the future. Possible, but awkward, to say the least.

    For example, if you are doing web crawling, and you have not saturated the pipe to the internet, then it is not worth your time to use more servers.

    ...unless, maybe, it's CPU-bound? And this is odd to mention in a section about parallelization -- wouldn't slow servers be a prime candidate for some sort of parallelization, even on a single machine, even if it's evented?

    If you have a process running and you want it to be restarted automatically if it crashes, use Upstart.

    Cool, but it looks like Upstart is becoming a Maslow's Hammer for this guy. Tools like Nagios, Monit, and God exist for a reason -- one such reason is knowing when and why your processes are dying even if they're spread across a cluster.

    NoSQL is NotWorthIt

    People who have read my other posts likely know where I stand on this, but...

    Redis, even though it's an in-memory database, has a virtual memory feature, where you can cap the amount of RAM it uses and have it spill the data over to disk. So, I threw 75GB of data at it, giving it a healthy amount of physical memory to keep hot keys in...

    So you found out an in-memory database wasn't suitable when you have far more data than physical memory? Great test, there.

    Redis was an unknown quantity...

    Maybe so, but that wasn't terribly hard to guess.

    Yes, maybe things could have been different if I used Cassandra or MongoDB...

    So maybe you should've benchmarked a NoSQL database which is actually designed to solve the problem you're trying to solve? Just a thought.

    especially if something like PostgreSQL can do the same job.

    If PostgreSQL could do the same job, the current generation of NoSQL databases wouldn't have been invented. Unless something's changed, PostgreSQL can't scale beyond a single machine for writes, unless you deliberately shard at the application layer, which would violate his rule about multiple datastores, wouldn't it?

    It seems like the attitude is to no

  • Re:Comment your code (Score:3, Interesting)

    by SL Baur ( 19540 ) <> on Monday September 06, 2010 @05:25PM (#33492232) Homepage Journal

    Put enough comments in your code so that five years from now you (and others) can remember what you indented the code to do.

    /* This is hairy. We need to compute where the XEmacs binary was invoked
              from because temacs initialization requires it to find the lisp
              directories. The code that recomputes the path is guarded by the
              restarted flag. There are three possible paths I've found so far
              through this:

              temacs -- When running temacs for basic build stuff, the first main_1
                will be the only one invoked. It must compute the path else there
                will be a very ugly bomb in startup.el (can't find obvious location
                for doc-directory data-directory, etc.).

              temacs w/ run-temacs on the command line -- This is run to bytecompile
                all the out of date dumped lisp. It will execute both of the main_1
                calls and the second one must not touch the first computation because
                argc/argv are hosed the second time through.

              xemacs -- Only the second main_1 is executed. The invocation path must
                computed but this only matters when running in place or when running
                as a login shell.

              As a bonus for straightening this out, XEmacs can now be run in place
              as a login shell. This never used to work.

              As another bonus, we can now guarantee that
              (concat invocation-directory invocation-name) contains the filename
              of the XEmacs binary we are running. This can now be used in a
              definite test for out of date dumped files. -slb */

    OK. So now everyone knows how Lisp programs written with a core in C initialize themselves, right?

    And as much as people may joke about it, XEmacs was tested to ensure that it worked as a login shell prior to release.

  • Sk (Score:1, Interesting)

    by Anonymous Coward on Monday September 06, 2010 @05:58PM (#33492490)

    Next up is a rework of the database IO codebase so that it becomes feasible to plug-n-play different databases.

    Why? Pick a widely use database that works and stick with it. Less work, simper code, easier to test and a shorter route to maturity.

  • by laura42 ( 1893282 ) on Monday September 06, 2010 @05:59PM (#33492504) Homepage
    I agree with all of this, but also... Doctors, lawyers, and most (non-software) engineers have reasons not to present their work as being radically different from others in the same profession. Doctors who are sued for malpractice generally want to argue that what they did is the same as what any other doctor would do, lawyers want to argue that their positions are based on precedent, and civil engineers want to convince people that their suspension bridge designs are based on the same principles that make other suspension bridges safe.
  • Re:Comment your code (Score:5, Interesting)

    by Thangodin ( 177516 ) < minus pi> on Monday September 06, 2010 @06:17PM (#33492622) Homepage

    If you are new to coding, don't be a bedroom programmer. You are no longer writing a 10,000 line app alone in your bedroom. You may be working on a million line app with a team. Change your habits accordingly. Learn to work with other people.

    Programming is one of those things that humans are not quite smart enough to do. This means you. Check your ego at the door. In the early 90's, IBM estimated that 80% of large projects in the industry (one million lines or more) were "abandoned in disgust". This should give you some idea of what you are up against.

    Come to work knowing what you are doing. This may mean cramming in your off hours. Don't say that you don't know how to do something. Say that you do and then learn it!

    Put in comments where they are needed, and maintain them. You will forget what you were doing within three months. The harder it was to code, the more you need the comments.

    Use descriptive variable names. Try to organize your data into conceptually simple variables where possible.

    If you have to complicate a mathematical formula by breaking it into sections appropriate for inner and outer loops, put the formula in the comments. It may even be worth putting in an ASCII diagram if you are working with geometry.

    If you can't see the bug, it's because you have become blind to the code. Get someone else to take a look. The mistake may be embarrassingly obvious to a new set of eyes.

    If speed is a factor, preprocess the data. Offload runtime cycles to preprocessing.

    Maintain an up to date user manual for all tools and apps. Add to it as you add features, update it as you update the features.

    Avoid magic numbers where possible, and put any magic numbers you do use into defines, again with descriptive names.

    If you can, avoid virtual methods and pointers in streamed objects. This way you can bulk load them and bulk write them. Indices often fast enough, or can be converted to pointers if need be after loading.

    If you have lots of booleans, consider a bit array.

    Try to write reusable code. Code for the general case when possible, but...

    Normalize your data and objects. Don't waste memory and time maintaining variables you don't need. Don't repeat yourself.

    Your key indexes should be integers, never strings. Yes, I have seen databases keyed on memo fields--they were tragically slow.

    If updating an existing project, get the client to sign off on what is not to be changed or fixed, and make certain that the QA department gets this list. Otherwise bugs will creep onto the list that you are not actually required to fix, expanding the scope of the project.

    Build test harnesses whenever you can which can be turned on with a simple switch. This will make regression testing a lot easier.

  • by dbIII ( 701233 ) on Monday September 06, 2010 @08:40PM (#33493598)

    Hell, I got an N900 6 months ago, and it's already EOL'd as far as updates to the OS are concerned.

    So, where exactly did you get that information from and why do you think it is real?

  • by B1ackDragon ( 543470 ) on Monday September 06, 2010 @08:45PM (#33493630)
    Very much agreed, though I'm not a simulationist per se anymore (these days I just do all kinds of crazy things to large data sets, and need to remember what crazy things I did, and when, and as you said, what the output means!)

    Here's another tip, which I also thought about last time someone asked slashdot about scientific data organization: keep a wiki, and write down all the things you do (at least everything that isn't trivial to reproduce) there. Commands, paramaters used, input and output files created, etc. I organize chronologically. Having a digital "lab notebook" can be invaluable, and it makes the problem of organizing things much easier, since everything important is indexed in the wiki and can be looked up based on the timeframe of the project.
  • Re:Comment your code (Score:5, Interesting)

    by Coryoth ( 254751 ) on Monday September 06, 2010 @09:16PM (#33493830) Homepage Journal

    But not so many that you (or others) will find it more work than it's worth to change the comments when the code changes.

    I prefer code with no comments to code with actively misleading comments, and I hate code with no comments! :)

    The trick is that if you are writing comments describing what the code is intended to do, you can write those comments in something like JML [] or Frama-C []'s ACSL. That was you can use ESC/Java2 and Junit, or Frama-C, to do your checking that the code does what you intended. You get two benefits: more rigorous checks on your code (including use of theorem provers from ESC/Java and Frama-C); if your documentation ever falls out of date with the code, you'll immediately get errors flagged.

  • Re:Comment your code (Score:4, Interesting)

    by Hooya ( 518216 ) on Monday September 06, 2010 @09:40PM (#33493954) Homepage

    I have tried implementing a 'design document' process for the better part of 10 years that I've been with this group. It's never gotten done. We came close about a year ago. Here's why it's I still try (while knowing that it'll never get done):

    There's a reason architects use blueprints.

  • by Anonymous Coward on Monday September 06, 2010 @10:49PM (#33494318)

    You're right. I never really gave it much thought until reading your reply, but I've been doing that more and more. Practically all the code I write to change data formats, do calculations, etc. has a standard "comment" indicator (same as UNIX shells: "#" as the first character of the line), and all the data-reading routines skip over or optionally echo those comments to STDERR (verbose mode) as the data is processed. That means I can chuck all sorts of comments in there and nothing gets messed up in the rest of the data. The first few lines of the file ordinarily have the date, source of the file, labels for the columns, units, problems/limitations etc., and those get preserved and sometimes augmented by more information from programs as the data gets passed down the processing pipeline (e.g., like program arguments). It helps a lot.

    It is sloppier than having a proper data format where that sort of metadata is encoded in separate fields (e.g., netCDF files []), but it's also very simple and much better than unlabeled, flat ASCII text files full of cryptic numbers. I've also pretty much standardized on tab as column delimiter and newline as record delimiter, letting me easily use standard UNIX tools like head, tail, cut, paste, sort, grep, etc. -- why reinvent the wheel for such simple operations? It's not a real database, of course, but it's amazing how versatile and quick these tools can be for batch processing massive quantities of data.

  • Re:Comment your code (Score:2, Interesting)

    by BrokenHalo ( 565198 ) on Tuesday September 07, 2010 @05:37AM (#33496284)
    Took me a bit of headscratching before I realised what was going on. Ouch.

    That's one reason why I have a tendency to be suspicious of editors that offer a WYSIWYG interface. I much prefer YAFIYGI (You Asked For It, You Got it) editors like (of course) TECO [] or more recently, EMACS or (if you insist) VI.
  • Re:Comment your code (Score:3, Interesting)

    by kbielefe ( 606566 ) <> on Tuesday September 07, 2010 @08:49AM (#33497020)

    That's because people who don't know what they are doing don't know that they don't know what they're doing. Those types of comments should be accompanied by a clear competence or acceptance test. For example, the last such comment I wrote went something like: /* This might look like an unnecessary delay, but the timing has been carefully calibrated against a wide range of marginal real world conditions. If you touch this function, you must ensure it does not time out under these configurations... */

    In other words, "knowing what you're doing" must be definable and transferable knowledge.

  • Re:Comment your code (Score:3, Interesting)

    by Greyfox ( 87712 ) on Tuesday September 07, 2010 @09:44AM (#33497454) Homepage Journal
    That's true. The function in question was actually quite well commented, and that last one was really more of a warning that all the comments should be read before messing with it. A set of unit tests would actually have been pretty nice and would have saved us a lot of trouble over the 5 years we worked on it, but it would have been a tough sell to management.
  • Re:Comment your code (Score:3, Interesting)

    by jgrahn ( 181062 ) on Tuesday September 07, 2010 @02:37PM (#33500364)

    Commenting code isn't enough, it's just a small part of the design and documentation process. Comments are there to tie the code to the relevant part in your design document, which really is a part of programming people should put more effort into.

    It's been said for years, but it is almost never done. When it is done, it's most often (IME) done _after the fact_ because of some requirement to produce the paperwork. Perhaps it's time to give up on it. Is there a real reason for insisting on a design document, or is it just some sort of self-flagellation on the part of programmers?

    It might be flagellation, but not self-flagellation. Programmers hate writing useless documents. Management wants those documents *before* the code is written, but the programmer knows he'll rework the design because of stuff he learns while doing the coding. Management also wants the documents in some word processor format not compatible with the version control software, so it cannot be updated along with the code either.

    If I could have it my way, I'd have these things:

    • User reference documentation, like man pages. File and protocol format specifications. Standards. Those can be very helpful for the programmer too. They cover the requirements aspect of things.
    • Good checkin comments from the version control system, i.e. "I did this because of that".
    • A *brief* text about the general design and architectural decisions, glossary etc. Things like "this program revolves around a single select() loop" or "we try to make this part bloody fast because ..." or "we use this trick throughout the code to avoid table lookups". Maybe a not-too-detailed class diagram, with a note that it might be obsolete and if you want a recent one you're free to draw one.
    • A clear design with clear naming and strong typing, and/or readable unit tests.
    • Comments on the class or module level: "class Foo represents this thing, with these limitations". On the function level too, but only if it's needed.

Never buy from a rich salesman. -- Goldenstern